와인 데이터 테스트¶

k-NN 적용¶

import numpy as np
import matplotlib.pyplot as plt

wine = np.loadtxt('winequality-red.csv', skiprows=1, delimiter=';')

wine.shape

(1599, 12)

wine[:100,-1]

array([5., 5., 5., 6., 5., 5., 5., 7., 7., 5., 5., 5., 5., 5., 5., 5., 7.,
       5., 4., 6., 6., 5., 5., 5., 6., 5., 5., 5., 5., 6., 5., 6., 5., 6.,
       5., 6., 6., 7., 4., 5., 5., 4., 6., 5., 5., 4., 5., 5., 5., 5., 5.,
       6., 6., 5., 6., 5., 5., 5., 5., 6., 5., 5., 7., 5., 5., 5., 5., 5.,
       5., 6., 6., 5., 5., 4., 5., 5., 5., 6., 5., 4., 5., 5., 5., 5., 6.,
       5., 6., 5., 5., 5., 5., 6., 5., 5., 4., 6., 5., 5., 5., 6.])

np.bincount(np.array(wine[:,-1],dtype=int))

array([  0,   0,   0,  10,  53, 681, 638, 199,  18], dtype=int64)

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X = wine[:,:-1]
y = wine[:,-1]      # 트레인과 테스트로 분리하지 않았다.> 모든 데이터가 훈련데이터가 된다.

model = KNeighborsClassifier(3)
model.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

pred_y = model.predict(X)
display(pred_y[:20], y[:20]) # 20개 씩만 비교해 보자.
# 예측값이 낮아지는 경향이 있네? 5점 6점이 많아서 
# 머신러닝 돌리기 전에 3점, 4점, 7점, 8점 데이터를 충분히 확보해서 비슷한 비율로 돌려야 한다.

array([5., 5., 5., 6., 5., 5., 5., 6., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
       4., 5., 5.])

array([5., 5., 5., 6., 5., 5., 5., 7., 7., 5., 5., 5., 5., 5., 5., 5., 7.,
       5., 4., 6.])

model.score(X,y)

0.7554721701063164

(np.abs(pred_y-y)>2).sum() # 극단적으로 점수를 평가한 경우가 11건이 있다.

12

idx = np.where(np.abs(pred_y-y)>2)[0]
display(y[idx], pred_y[idx])  # 2점 이상 차이나는 경우의 실제 점수와 평가점수 비교
# 8점이 5점이 될 수는 있다.(데이터가 5점이 많으므로)
# 하지만 데이터가 많은 6점이 3점으로 평가 받는 경우는 의심해 볼 여지가 있다. 
# 5점이 8점으로 평가 받는 경우 좋은 와인 일 수 있다?

array([8., 8., 7., 7., 8., 6., 7., 7., 8., 4., 8., 8.])

array([5., 5., 4., 4., 5., 3., 3., 4., 5., 7., 5., 4.])

plt.figure(figsize=[12,10])   # 예측값과 실제값의 상관관계를 그려보자.
plt.plot([2,9],[2,9],':')
plt.scatter(y+np.random.randn(len(y))/10, pred_y+np.random.randn(len(pred_y))/10, alpha = 0.3)

# 그래프를 이용하여 여러가지 의미를 생각할 수 있다.

<matplotlib.collections.PathCollection at 0x48a9642198>

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

model2 = KNeighborsClassifier(n_neighbors=3)
model2.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

pred_y_2 = model2.predict(X_test)

plt.figure(figsize=[12,10])
plt.plot([2,9],[2,9],':')
plt.scatter(y_test+np.random.randn(len(y_test))/10, pred_y_2+np.random.randn(len(y_test))/10, alpha=0.3)

<matplotlib.collections.PathCollection at 0x48a9764358>

선형 회귀 적용¶

from sklearn.linear_model import LinearRegression

model3 = LinearRegression()

model3.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

pred_y_3 = model3.predict(X)

(np.abs(pred_y_3 - y)>2).sum()

9

plt.hist(pred_y_3 - y, bins=np.arange(-4,5,0.5))

(array([  0.,   0.,   0.,   1.,  17.,  70., 275., 371., 575., 201.,  58.,
         23.,   7.,   1.,   0.,   0.,   0.]),
 array([-4. , -3.5, -3. , -2.5, -2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,
         1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5]),
 <a list of 17 Patch objects>)

와인데이터 회귀 적용¶

a = wine[:,0]
b = wine[:,1]

display(a,b)

array([7.4, 7.8, 7.8, ..., 6.3, 5.9, 6. ])

array([0.7  , 0.88 , 0.76 , ..., 0.51 , 0.645, 0.31 ])

plt.scatter(a,b,alpha=0.3)

<matplotlib.collections.PathCollection at 0x48a8529940>

fig = plt.figure(figsize=[20,20])

count = 0

for i in range(11):
    for j in range(i+1, 10):
        count+=1
        plt.subplot(10,10,count)
        plt.scatter(wine[:,i],wine[:,j],s=30,alpha=0.5)

a = wine[:,[0]]
b = wine[:,2]*10  # 10배 해주자. 전처리 과정 중 하나.



plt.scatter(a,b,alpha=0.3)

<matplotlib.collections.PathCollection at 0x48a64f1160>

model = LinearRegression()
model.fit(a, b)

w = model.coef_ 
b = model.intercept_ 

print('w =',w)
print('b =',b)

w = [0.75152989]
b = -3.54270001819109

1월 지하철 승하차 인원 분석 (0)	2019.02.28
지도학습 - 선형회귀 (1)	2019.02.27
sklearn 기본 틀 (0)	2019.02.26
Numpy를 활용한 수치근사법 (0)	2019.02.25
지도학습 - k-NN분류 (0)	2019.02.25

조환희의 학습 블로그

티스토리 뷰

와인 데이터 분석

와인 데이터 테스트¶

k-NN 적용¶

선형 회귀 적용¶

와인데이터 회귀 적용¶

'beginner > 파이썬 머신러닝 기초' 카테고리의 다른 글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30