Scikit-learn¶

파이썬에서 머신러닝을 실행할 수 있게하는 패키지/

Iris(붓꽃) 데이터¶

https://en.wikipedia.org/wiki/Iris_flower_data_set 참조
1936년 한 영국 통계학자에 의해 선형분류 문제의 예제로 활용되면서 머신러닝의 대표적인 예제로 활용되고 있다.
속성 : 꽃받침 길이(sepal length), 꽃받침 폭(sepal width), 꽃잎 길이(petal length), 꽃잎 폭(petal width)
타겟값(목표값) : setosa, versicolor, verginica
샘플 갯수 : 150개 (세품종 각각 50개씩)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris  # 자주 사용하는 자료라 함수가 사이킷 런에 들어있기 때문에 이렇게 로드할 수 있다.

iris = load_iris()

dir(iris)

['DESCR', 'data', 'feature_names', 'target', 'target_names']

iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

display(iris.target, iris.target.shape)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

(150,)

display(iris.target[iris.target==0].shape, iris.target[iris.target==1].shape, iris.target[iris.target==2].shape)
# 줄 단위로

(50,)

(50,)

(50,)

display(iris.data.shape, iris.data[:5])

(150, 4)

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

훈련세트와 테스트세트로 분리¶

머신러닝은 훈련 과정과 테스트 과정 두가지로 분리된다.
훈련세트를 가지고 예측모델을 훈련시킨 다음에 테스트세트로 훈련 성과를 판단한다.
sklearn.model_selection.train_test_split() 함수를 사용하면 편리하게 나눌 수 있다.
train_test_split() 함수는 기본값으로 훈련세트를 75%, 테스트세트를 25% 로 나눈다.
참고 : 전통적으로 데이터는 대문자 X, 타겟(또는 레이블)은 소문자 y 로 표시한다.

from sklearn.model_selection import train_test_split

#help(train_test_split)

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

# 데이터를 모두 집어넣어 분석해 모델을 만들 수 있겠지만, 새로운 데이터가 나타나야 그 모델을 테스트 할 수 있다.
# 그러므로 모델을 만들때 일부를 남겨두고 모델을 만든다음, 그 남겨진 데이터를 테스트용 데이터로서 사용한다.
# train은 훈련용, test 테스트용 75%, 25% 섞어서 랜덤하게(뽑는 방법을 설정할수도 있다.)

display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# 아이리스 데이터가 150x4, 타겟이 150x1인데 데이터를 훈련용으로 112x4, 테스트용으로38x4, 타켓을 훈련용으로 112x1, 테스트용으로 38x1 로 나눈다.

(112, 4)

(38, 4)

(112,)

(38,)

display(X_train[:5], y_train[:5])

array([[5.5, 2.4, 3.8, 1.1],
       [5. , 3.6, 1.4, 0.2],
       [4.6, 3.6, 1. , 0.2],
       [5. , 2.3, 3.3, 1. ],
       [5.7, 3.8, 1.7, 0.3]])

array([1, 0, 0, 1, 0])

4가지 속성에 대해 산점도 그리기¶

산점도(scatter map)은 두 가지 속성을 가진 데이터를 그래프에 점을 찍어 표시한 그림이다.
Iris 데이터의 4가지 속성에 대해 짝을 지어 산점도를 그려보자.

import pandas as pd

iris_df = pd.DataFrame(X_train, columns=iris.feature_names)
# DataFrame은 데이타 안에다가 칼럼 이름까지 넣을 수 있다. 인덱스를 자동으로 지정 할 수 있다.
iris_df[:5] # [:5,1] 이런건 안된다.

pd.plotting.scatter_matrix(iris_df, c=y_train, s=60, alpha=0.8, figsize=[12,12])
# scatter_matrix라는 함수를 이용하기 위해서 pandas를 사용했다.
# 데이터에서 2개씩 짝지어서 6개의 산점도를 그렸다.
# 자기자신은 히스토그램
print('')

k-NN (최근접 이웃) 예측모델 적용¶

k-NN 모델은 가장 가까이에 있는 k 갯수의 이웃 점들을 기준으로 예측하는 머신러닝 모델이다.
모델은 훈련세트로 훈련을 시키므로, X_train 과 y_train 을 활용한다.
아래 코드와 같이, 모델을 정의하고 fit() 함수를 호출하는 두 줄로 모델 훈련은 끝난다.
기하적인 특성

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1) # 기본값은 5
model.fit(X_train, y_train) # fit 데이터를 줄테니 모델을 만들어봐. > 이를 이용해 예측

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

임의의 점을 이용하여 예측을 해보자¶

아래 코드는 (6,3,4,1.5) 의 값을 가지는 샘플에 대한 예측 결과이다. (1 이므로 versicolor)

model.predict([[6,3,4,1.5]]) # 샘플이 하나라도 2차원 어레이를 넘겨야 한다
# predict 예측.

array([1])

모델 평가¶

모델을 만들었으므로 얼마나 정확한지 테스트세트인 X_test, y_test 로 성능을 확인해 보자.
0.973 의 결과는 테스트세트에서 97.3% 를 정확히 예측했다는 의미이다. (실행할 때마다 결과가 달라진다)

score = model.score(X_test, y_test) # score를 이용해 평가
print(score) #원래 잘 분리된 데이터라 높게 나왔다

0.9736842105263158

pred_y=model.predict(X_test)
pred_y==y_test

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

(model.predict(X_test)==y_test).mean()

0.9736842105263158

전체 코드¶

아래에서 보듯이 핵심 코드는 5줄 뿐이다.
이 예제에서는 속성 4개를 모두 사용하였으므로, 4차원 공간이 되어 결과를 그래프로 표시하기는 쉽지 않다.

import numpy as np

from sklearn.datasets import load_iris  # 함수를 불러온다.
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier  # 클래스를 불러온다.

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

model = KNeighborsClassifier(n_neighbors=1) # n_neighbors 주변에 있는 원소를 몇 개 고려 할 것 인지.
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print(score)

fit , predict, score 이 세개면 끝

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

model = KNeighborsClassifier(n_neighbors=1)  # 섬이 생길 수 있다.
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print(score)

0.9473684210526315

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print(score)

0.9473684210526315

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

model = KNeighborsClassifier(n_neighbors=10)
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print(score)

0.9210526315789473

n_neighbors 의 숫자가 커질수록 직선 경향이 커진다.

from sklearn.neighbors import KNeighborsClassifier를 바꾸면 다른 모델을 적용해볼수 있다.

ex)
from sklearn.svm import SVC
model = SVC()
교재131p Support Vector Machine 이 모듈의 특징은 아주 매끄러운 곡선을 그려준다. 다차원에서는 매끄러운 곡면을 그려준다.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

model = SVC(C=1.0, gamma=0.1)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print(score)

0.9736842105263158

데이터가 작으니 매번 어떻게 뽑느냐에 따라 값이 달라진다.

머신러닝 기초_비용함수 (0)	2019.02.22
머신러닝 기초 _ 거리 (0)	2019.02.22
머신러닝 기초_iris활용 (0)	2019.02.22
머신러닝과 파이썬 (0)	2019.02.21
머신러닝 기초 (0)	2019.02.20

조환희의 학습 블로그

티스토리 뷰

Scikit-learn 기초