선형회귀 (Linear Regression)¶

회귀는 데이터가 주어졌을 때, 실수값인 타겟값(또는 목표값)을 예측하는 방법이다.

나이	성별	키	몸무계
35	남	175	67
...	...	...	...
27	여	163	52

위와 같은 데이터가 주어졌을 때, 키(데이터)에 따른 몸무계(타겟값)를 예측하는 것은 회귀 문제이다.
회귀 중에서도, 직선 또는 곧은 평면(굽은 평면이 아님)으로 타겟값을 예측하는 것을 선형회귀 라고 한다. 아래 그림에서 직선으로 예측한 경우에 해당한다.

속성이 하나 뿐일 때는 위와 같이 직선으로 표현할 수 있지만 속성이 2개일 때는 곧은 평면, 3개 이상에서는 초평면으로 표현한다.

선형회귀에서 굳이 타겟값을 별도로 구분했지만, 다르게 표현하면 타겟값을 포함한 모든 속성의 상관관계를 평면으로 표현한 것으로 이해할 수 있다.
아래에서 Iris 데이터를 가지고 선형회귀를 적용해 보겠다.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data[:,:3]
y = iris.data[:,3]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

model.score(X_test, y_test)

0.9303708802317243

pred_y = model.predict(X_test)
display(pred_y[:10], y_test[:10])

array([0.21759245, 1.29023169, 0.00583782, 1.79081403, 2.02604528,
       1.23760017, 0.32242506, 1.9431344 , 1.74635629, 1.60663507])

array([0.1, 1.5, 0.3, 2. , 2.1, 1.3, 0.4, 1.8, 2.3, 1.5])

model.coef_  # 속성이 3개이므로 기울기 3개

array([-0.24011138,  0.25055843,  0.53676221])

model.intercept_ # y절편

-0.18773621737029322

u = ax + by + cz + d¶

w0+w1x1+w2x2+w3x3+w4x4=0

배가 너무 고프다. 너무너무 고프다. 죽을 거 같다.

  File "<ipython-input-10-be0e01dd9148>", line 1
    배가 너무 고프다. 너무너무 고프다. 죽을 거 같다.
        ^
SyntaxError: invalid syntax

import pandas as pd

# 산점도를 한번에 그리는 방법
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
pd.plotting.scatter_matrix(iris_df, c=y, s=60, alpha=0.8, figsize=[12,12])

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDDEC320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD2E5668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD30ACF8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD33E3C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD367A58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD367A90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD3BF780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD3E9E10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD41B4E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDD443B70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDB05240>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDB2C8D0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDB53F60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDF95630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDFBDCC0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000008BDDFF1390>]],
      dtype=object)

fig = plt.figure(figsize=[12,8])
fig.suptitle('Iris\n(setosa:0, versicolor:1, virginica:2)', fontsize=20)
count=0

for i in range(3):
    for j in range(i+1, 4):
        count+=1
        plt.subplot(2,3,count)
        plt.scatter(iris.data[:,i],iris.data[:,j],c=iris.target,s=60,alpha=0.5)
        plt.xlabel(iris.feature_names[i])
        plt.ylabel(iris.feature_names[j])
        
plt.colorbar(shrink=0.7)

<matplotlib.colorbar.Colorbar at 0x8bdd92bfd0>

(150,) >(150,1)로 reshape [ 1,2,3,4 ] > [[1],[2],[3],[4]] 형태로 변경

X = iris.data[:,0]
X

array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
       6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
       6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
       6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
       6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
       6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
       7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
       7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
       6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])

X.shape
X
display(X.shape,X)

(150,)

array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
       6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
       6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
       6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
       6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
       6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
       7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
       7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
       6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])

X는 2차원 어레이 형태여야 한다.¶

X.reshape(150,1)[:10] # X.reshape(-1,1) , iris.data[:,[0]]

array([[5.1],
       [4.9],
       [4.7],
       [4.6],
       [5. ],
       [5.4],
       [4.6],
       [5. ],
       [4.4],
       [4.9]])

iris.data[:,[2,3]][:10] #iris.data[:,2:4]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.7, 0.4],
       [1.4, 0.3],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.5, 0.1]])

X = iris.data[:,[0]]
y = iris.data[:,1]

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

w = model.coef_[0] # 기울기
b = model.intercept_ # y 절편

print('score =', model.score(X,y))
print('w =', w)
print('b =', b)

score = 0.011961632834767588
w = -0.05726823379716482
b = 3.3886373794881

1번 컬럼과 2번 컬럼의 산점도를 보면 흩뿌려져 있다. 그래서 0에 가깝게 나왔다.

model.coef_ # 결과는 Numpy 어레이 이다.

array([-0.05726823])

X = iris.data[:,[0]]
y = iris.data[:,2]

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

w = model.coef_[0] # 기울기
b = model.intercept_ # y 절편

print('score =', model.score(X,y))
print('w =', w)
print('b =', b)

score = 0.7599553107783261
w = 1.857509665421445
b = -7.095381478279311

1번 컬럼과 3번 컬럼의 산점도를 보면 일직선에 많이 놓여있다. 기울기가 1처럼 보일 수 있겠지만 X축과 y축 폭이 달라 실제로는 2에 가까운 기울기를 갖고 있다.

fig=plt.figure(figsize=(6,10))
plt.title('Linear Regression\n(petal_length vs petal_width)')

plt.scatter(X,y,c=iris.target,s=60)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[2])
plt.colorbar()

plt.plot([4,8],[4*w+b,8*w+b],'y',lw=10,alpha=0.5) 
# 끝값 4과 8은 직접 그래프를 보고 정한 값. 좌표가 (4,4*w+b), (8,8*w+b)
plt.text(0,3,'coef: %f\nintercept: %f' % (w,b), va='top', fontsize=15,color='b') # %3f 라고 하면 소수점 3개까지 나옴
plt.axis('equal')
plt.grid()

모든 머신러닝 알고리즘은 오차값(비용함수)을 정의하고 오차값을 최소화하게 한다. 대표적으로 최소 제곱법(MSE, mean squared error)을 사용

result = model.predict([[5],[6],[7]])
display(result, 5*w+b, 6*w+b, 7*w+b)

array([2.19216685, 4.04967651, 5.90718618])

2.192166848827913

4.049676514249359

5.907186179670804

model.predict(X)[:10]

array([2.37791782, 2.00641588, 1.63491395, 1.44916298, 2.19216685,
       2.93517071, 1.44916298, 2.19216685, 1.07766105, 2.00641588])

y[:10]

array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5])

score = model.score(X,y)
display(score) #R^2 값

0.7599553107783261

회귀에서의 평가 점수는 $R^2$ 이라는 것을 사용한다.

$R^2 = 1 - \frac{\sum (y-\hat{y})^2} {\sum (y-\bar{y})^2} $ ($\bar{y}$ 는 평균, $\hat{y}$ 는 예측값)
$R^2$ 값이 1 이면 완벽하게 예측했다는 의미이고, 0 이면 누구나 하듯이 평균값으로 에측했다는 의미이다. 그리고 음수이면 평균값 예측보다도 못하다는 의미가 된다.

앞에서 model.fit() 을 하고 나면, model.coef_ 으로 기울기 값을 model.intercept_ 로 y절편 값을 얻을 수 있다.
그런데 앞의 예제는 앞에서 중요하게 강조했던 학습세트와 테스트세트로 분리하지 않았다. 그냥 전체 데이터를 사용하여 두 속성의 상관관계 만을 알고 싶었기 때문이다.
하지만 제대로 머신러닝 과정을 밟기 위해 아래에서 학습세트와 테스트세트로 분리하여 적용해 보자. 이렇게 하면 여러가지 예측모델 중에서 어떤 예측모델이 더 나은지 판단할 수 있게 된다.

from sklearn.model_selection import train_test_split

X = iris.data[:,0].reshape(-1,1) # reshape() 함수에 주의
y = iris.data[:,2]

X_train,X_test,y_train,y_test = train_test_split(X, y)

model = LinearRegression()
model.fit(X_train, y_train)

w = model.coef_[0] # 기울기
b = model.intercept_ # y 절편

print('w =',w)
print('b =',b)

w = 1.9178837052062192
b = -7.417208297111942

score1 = model.score(X_train, y_train)
score2 = model.score(X_test, y_test)
display('학습세트 점수: %f' % score1, '테스트세트 점수: %f' % score2) # R^2 값

'학습세트 점수: 0.758323'

'테스트세트 점수: 0.760598'

결과에서 보듯이 테스트세트에 대한 $R^2$ 값이 학습세트에 대한 값 보다 못한 결과를 얻었다.
이것은 당연한데, 학습에 사용하지 않은 데이터로 테스트를 했기 때문이다.

이제 다른 방법으로 학습 결과를 평가해 보자.
RMSE(root-mean-square error) = $\sqrt{\frac{(y-\hat{y})^2} {N} }$ , (RMSE 는 값이 작을 수록 결과가 좋은 것이다.)
선형회귀의 수학적 원리는 RMSE 를 최소화 하는 초평면을 찾는 것이다.

pred_y = model.predict(X_test)
pred_y

array([7.73407297, 3.13115208, 2.3639986 , 1.78863349, 3.70651719,
       2.17221023, 3.32294045, 2.93936371, 1.02148001, 2.17221023,
       6.19976601, 4.85724742, 4.09009393, 3.13115208, 4.66545905,
       5.04903579, 5.43261253, 5.81618927, 4.66545905, 1.98042186,
       5.43261253, 4.85724742, 3.13115208, 1.98042186, 5.24082416,
       1.98042186, 1.98042186, 4.2818823 , 5.81618927, 7.35049623,
       2.17221023, 5.43261253, 4.09009393, 5.24082416, 4.66545905,
       2.93936371, 3.51472882, 1.59684512])

y_test

array([6.4, 1.3, 1.7, 1.6, 3.9, 1.6, 3.9, 1.7, 1.3, 1.5, 5.9, 5.6, 4. ,
       4.4, 5. , 5.1, 4.7, 4.9, 4.9, 3.3, 5.2, 5.3, 3.8, 4.5, 4.4, 1.5,
       1.5, 4. , 5.7, 6.7, 1.6, 5.6, 4.5, 4.6, 4.9, 1.7, 3.5, 1.6])

MSE = ((y_test - pred_y)**2).sum()/len(y_test)
MSE

11.70816591661198

RMSE = np.sqrt(MSE)
RMSE

3.421719730868088

이제 타겟값인 petal_width 를 제외한 나머지 세가지 속성을 모두 사용하여 선형회귀를 적용해 보자.
이럴 경우, 4차원 상에 3차원 초평면으로 예측하기 때문에 그림으로 결과를 표현하기는 쉽지 않다.

속성 3개¶

X = iris.data[:,:3]
y = iris.data[:,3]

X_train,X_test,y_train,y_test = train_test_split(X, y)

model = LinearRegression()
model.fit(X_train, y_train)

w = model.coef_ # 기울기 (데이터의 속성이 3개이므로 값이 3개임), 1번 4번, 2번 4번, 3번 4번
b = model.intercept_ # y 절편

print('w =',w)
print('b =',b)

w = [-0.16689804  0.25010982  0.51211603]
b = -0.5210593519382603

score1 = model.score(X_train, y_train)
score2 = model.score(X_test, y_test)
display('학습세트 점수: %f' % score1, '테스트세트 점수: %f' % score2) # R^2 값

'학습세트 점수: 0.934797'

'테스트세트 점수: 0.942457'

pred_y = model.predict(X_test)
pred_y

array([ 0.22852355,  1.3810536 ,  0.24630809,  1.39431683,  1.25604614,
        1.56212612,  0.12591055,  0.27127163,  1.28361989,  0.34863641,
        0.39627909,  1.82303129,  1.59916324,  2.30157793,  1.50739304,
        0.7893833 ,  1.80030469,  1.8681112 ,  2.15032239,  2.38840607,
        1.33821062,  1.80624659,  1.97895048,  1.54300959,  1.99787722,
        0.12843217,  0.27255616,  0.21416557,  0.3142095 ,  1.61214808,
        1.61205319,  0.29866189,  0.20465475, -0.03109709,  0.18687021,
        1.91100163,  0.16062214,  1.59093696])

y_test

array([0.2, 1.4, 0.2, 1.3, 1.3, 1.5, 0.2, 0.2, 1.4, 0.3, 0.4, 2.4, 1.5,
       2.5, 1.5, 1.1, 1.9, 2.1, 1.8, 2. , 1.3, 2. , 2.1, 1.5, 1.6, 0.2,
       0.2, 0.2, 0.4, 1.8, 1.5, 0.2, 0.1, 0.3, 0.2, 2.1, 0.2, 1.7])

MSE = ((y_test - pred_y)**2).sum()/len(y_test)
MSE

0.03472840593535785

RMSE = np.sqrt(MSE)
RMSE

0.18635559002980792

MAE = np.abs(y_test - pred_y).sum()/len(y_test)
MAE

0.13405310588599553

속성이 3개이므로 w 의 값은 3개이다.

$pred\_y = w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b$
train_test_split() 함수를 호출할 때 마다 훈련세트/테스트세트 가 새롭게 구성되므로 score 가 달라질 수 있고 학습 보다 테스트 시 더 score 가 높은 경우도 있다.
다양한 선형회귀 모델은 다음 URL 을 참고하자. (http://scikit-learn.org/stable/modules/linear_model.html)
선형회귀의 변종인 릿지회귀와 라쏘회귀는 뒤에서 다룬다.
2차곡선, 3차곡선 과 같이 다항식을 적용한 다항회귀도 있다. (위 URL의 1.1.16 참고)

x = [[1,2,3],[2,2,4]]
model.predict(x)

array([1.65958903, 1.96133602])

model.coef_[0]*1 + model.coef_[1]*2 + model.coef_[2]*3 + model.intercept_
# [1,2,3] 샘플에 대한 예측값 직접 계산

1.6595890305490522

선형회귀에서 사용하는 선형 알고리즘은 다른 많은 머신러닝 알고리즘의 기반이 된다.
특히 신경망의 핵심 알고리즘은 선형회귀에서 나온 것이며, 가장 간단한 신경망은 곧 선형회귀와 같다.

지도학습 - LinearSVM_1 (0)	2019.03.05
1월 지하철 승하차 인원 분석 (0)	2019.02.28
와인 데이터 분석 (0)	2019.02.26
sklearn 기본 틀 (0)	2019.02.26
Numpy를 활용한 수치근사법 (0)	2019.02.25

조환희의 학습 블로그

티스토리 뷰

지도학습 - 선형회귀

선형회귀 (Linear Regression)¶

u = ax + by + cz + d¶

X는 2차원 어레이 형태여야 한다.¶

속성 3개¶

'beginner > 파이썬 머신러닝 기초' 카테고리의 다른 글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31