단일선형회귀분석 ②

티스토리 뷰

beginner/파이썬 분석

단일선형회귀분석 ②

johh 2019. 1. 28. 20:22

이번에는 실제 데이터를 가지고 실습을 해보도록 하겠다.

1. Bostion dataset 로드

from sklearn import datasets

sklearn 패키지에서 제공하는 open dataset을 가져오기 위해 사용하는 모듈 dataset이다.

boston_house_prices = datasets.load_boston()

datasets 모듈을 통해 보스턴 집 가격 데이터를 가져와 boston_house_prices 변수에 저장한다.

print(boston_house_prices.keys())

로드한 보스턴 전체 데이터에 key값을 출력한다.

print(boston_house_prices.data.shape)

보스턴 전체 데이터 중 data에 대한 전체 행, 열 길이를 출력한다.

print(boston_house_prices.feature_names)

보스턴 데이터에 사용하는 컬럼 이름을 출력한다.

다음은 boston dataset을 로드해 보자.

In:

from sklearn import datasets

boston_house_prices = datasets.load_boston()

print(boston_house_prices.keys())

print(boston_house_prices.data.shape)

print(boston_house_prices.feature_names)

Out:

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
(506, 13)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']

2. Boston dataset 정보 확인

다음은 보스턴 데이터 셋 정보를 출력해 보겠다. Sklearn에서 제공하는 open dataset은 DESCR이라는 데이터에 대한 정보를 같이 제공한다. 출력된 결과물을 확인하면 각 컬럼들에 대한 설명과 길이가 적힌 것을 확인할 수 있다. 다음은 Boston dataset의 정보 확인에 대한 내용이다.

In:

print(boston_house_prices.DESCR)

Out:

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

3. Boston dataset을 데이터 프레임으로 정제

다음은 보스턴 데이터 셋을 데이터프레임으로 정제하도록 하겠다.

data_frame = pd.DataFrame(boston_house_prices.data)

boston_house_price 변수에 전체 데이터 중 data에 해당하는 값만 DataFrame형으로 변경 후 data_frame에 저장한다.

data_frame.tail()

두 번째 줄은 tail함수를 사용해 전체 데이터 중 마지막 5개 데이터만 출력한다.

5개가 출력되는 이유가 뭘까?

이유는 괄호 안에 숫자를 입력하지 않을 시 기본 값으로 5개가 지정되기 때문이다. 다음은 data_frame안에 저장된 데이터프레임에 컬럼명을 교체 하도록 하겠다. 기존 boston_frame 컬럼 이름을 확인하면 숫자로 구성된 것을 확인할 수 있었다.

다음은 Boston dataset을 데이터프레임 형식으로 정제한 내용이다.

In:	import pandas as pd

In:

data_frame = pd.DataFrame(boston_house_prices.data)

data_frame.tail()

Out:

	0	2	4	5	6	7	8	9	10	11	12
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88

data_frame.columns = [boston_house_prices.feature_names]
boston_house_prices.feature_names

이제 이 숫자로 구성된 컬럼명을 보스턴 데이터에 원래 컬럼명인 feature_names로 교체하도록 하겠다. 간단히 데이터 프레임이 저장된 변수 뒤에 columns를 입력해 주고 교체를 원하는 feature_names를 저장해주면 된다. 결과를 보면 컬럼명이 바뀐 것을 확인할 수 있다.

In:	data_frame.columns = [boston_house_prices.feature_names]

In:	boston_house_prices.feature_names
Out:	array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

data_frame['Price'] = boston_house_prices.target

예측하고자 하는 대상인 종속변수 y값을 데이터프레임에 추가하도록 하겠다. 첫 번째 줄은 data_frame에 'price'라는 컬럼을 만들고 boston_house_prices에 저장된 데이터 중 "target"데이터 즉 종속변수를 데이터프레임에 저장한다. 결과물을 확인하면 Price라는 컬럼이 추가된 것을 확인할 수 있다.

다음은 데이터프레임으로 정제한 후 컬럼명 변경에 대한 내용이다.

In:

data_frame['Price'] = boston_house_prices.target
data_frame.tail()

Out:

	0	2	4	5	6	7	8	9	10	11	12	Price
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9

(4) 산점도 표현

data_frame.plot(kind="scatter", x="RM", y="Price", figsize=(6,6),

color="black", xlim = (4,8), ylim = (10,45))

다음은 산점도로 데이터를 나타내도록 하겠다. plot함수를 통해 x축에는 "RM" 독립변수 y축에는 "Price" 종속변수를 그렸다. 이때 새로운 파라미터 xlim, ylim이 생겨난 것을 확인할 수 있다. xlim, ylim파라미터는 x축과 y축에 범위를 사용자 임의로 설정을 해줄 때 사용하는 파라미터이다. x축을 보면 입력된 4에서 8까지 숫자로 되어 있는 것을 확인할 수 있다.

다음은 산점도 표현에 대한 내용이다.

In:

data_frame.plot(kind="scatter", x="RM", y="Price", figsize=(6,6),
color="black", xlim = (4,8), ylim = (10,45))

Out:

<matplotlib.axes._subplots.AxesSubplot at 0x1a615467160>

(5) 데이터 학습

다음은 선형회귀모델을 만들어 데이터를 학습시켜보도록 하겠다.

linear_regression = linear_model.LinearRegression()

linear_model.LinearRegression 함수를 통해 선형회귀모델을 만들어 linear_regression 변수 안에 저장한다.

linear_regression.fit(X = pd.DataFrame(data_frame["RM"]), y = data_frame["Price"])

linear_regression.fit 함수를 이용해 모델을 학습하게 하는 함수이다. 앞에서 언급 했듯이 x값은 꼭 2차원 형태로 입력해야 하고 y값은 기존형태로 입력하면 된다. x값에는 data변수 안에 있는 "RM"데이터 y값에는 "Price"데이터를 입력했다.

prediction = linear_regression.predict(X = pd.DataFrame(data_frame["RM"]))

linear_regression.predict 함수를 통해 학습된 선형회귀모델에 "RM"값을 입력 값으로 해서 y값을 예측한다. 예측된 y값은 prediction 변수에 저장된다.

print('a value = ', linear_regression.intercept_)

linear_regression.intercopt_를 통해 선형회귀모델의 a계수를 출력한다.

print('b balue = ', linear_regression.coef_)

linear_regression.coef_를 통해 선형회귀모델의 b계수를 출력한다.

다음은 데이터 학습에 대한 내용이다.

In:

linear_regression = linear_model.LinearRegression()

linear_regression.fit(X = pd.DataFrame(data_frame["RM"]), y = data_frame["Price"])

prediction = linear_regression.predict(X = pd.DataFrame(data_frame["RM"]))

print('a value = ', linear_regression.intercept_)

print('b balue = ', linear_regression.coef_)

Out:

a value = [-34.67062078]
b balue = [[9.10210898]]

6. 적합도 검증

다음은 잔차를 구하도록 하겠다.

residuals = data_frame["Price"] - prediction

잔차를 구하는 공식인 실제 값 "Price"에서 prediction에 저장된 예측 값 y를 빼주어 residuals 변수에 저장한다.

residuals.describe()

describe함수를 통해 다양한 요약 통계를 생성한다.

다음은 적합도 검증 중 잔차에 대한 내용이다.

In:

residuals = data_frame["Price"] - prediction

residuals.describe()

Out:

	Price
count	5.060000e+02
mean	1.899227e-15
std	6.609606e+00
min	-2.334590e+01
25%	-2.547477e+00
50%	8.976267e-02
75%	2.985532e+00
max	3.943314e+01

다음은 적합도 검증 방법 중 결정계수를 구하도록 하겠다.

SSE = (residuals**2).sum()

residual에 저장된 잔차 값을 제곱한 값을 numpy.sum함수를 이용해 더해준 후 SSE 변수에 저장한다.

SST = ((data_frame["Price"]-data_frame["Price"].mean())**2).sum()

"Price"에 저장된 실제 y값에서 numpy.mean 함수를 이용해 "Price"를 평균한 값을 빼고 제곱한 값을 numpy.sum함수를 이용해 더해준 후 SST변수에 저장한다.

R_squared = 1 - (SSE/SST)

결정계수를 구하는 식과 같이 SSE에서 SST를 나누고 1에서 빼준 값을 R_squared 변수에 저장한다.

print('R_squared = ', R_squared)

R_squared변수에 저장된 결정계수 값을 출력한다.

출력된 결정계수 48.35%로 결과를 통해 x값이 y값에 영향을 주는 것을 확인했다. 이것은 낮은 수치인 것 같지만 실제로 13개의 독립변수 중 1개인 것을 감안한다면 매우 높은 수치인 것을 확인할 수 있다.

다음은 적합도 검증 중 결정계수에 대한 내용이다.

In:

SSE = (residuals**2).sum()

SST = ((data_frame["Price"]-data_frame["Price"].mean())**2).sum()

R_squared = 1 - (SSE/SST)

print('R_squared = ', R_squared)

Out:

R_squared = Price 0.483525
dtype: float64

(7) 예측하여 플롯으로 표현

다음은 예측한 값을 이용해 산점도에 선형회귀선을 그리도록 하겠다.

data_frame.plot(kind="scatter", x="RM", y="Price", figsize=(6,6), color-"black", xlim = (4,8), ylim = (10, 45)

plot함수를 통해 산점도를 그렸다.

plt.plot(data_frame["RM"],prediction,color="blue")

산점도 위에 선형회귀선을 그린다. 출력된 결과를 보면 구하고자 했던 선형회귀선이 그려진 것을 확인할 수 있다.

다음은 선형회귀선 및 산점도 표현에 대한 내용이다.

In:

data_frame.plot(kind = "scatter", x = "RM", y = "Price", figsize = (6,6),
color = "black", xlim = (4,8), ylim = (10, 45))

#Plot regressin line
plt.plot(data_frame["RM"],prediction,color="blue")

Out:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-50-6e6fd7a3c664> in <module>
      3 
      4 #Plot regressin line
----> 5 plt.plot(data_frame["RM"],prediction,color="blue")

C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
   2811     return gca().plot(
   2812         *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2813         is not None else {}), **kwargs)
   2814 
   2815 

C:\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
   1808                         "the Matplotlib list!)" % (label_namer, func.__name__),
   1809                         RuntimeWarning, stacklevel=2)
-> 1810             return func(ax, *args, **kwargs)
   1811 
   1812         inner.__doc__ = _add_data_doc(inner.__doc__,

C:\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in plot(self, scalex, scaley, *args, **kwargs)
   1609         kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D._alias_map)
   1610 
-> 1611         for line in self._get_lines(*args, **kwargs):
   1612             self.add_line(line)
   1613             lines.append(line)

C:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _grab_next_args(self, *args, **kwargs)
    391                 this += args[0],
    392                 args = args[1:]
--> 393             yield from self._plot_args(this, kwargs)
    394 
    395 

C:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _plot_args(self, tup, kwargs)
    363 
    364         if len(tup) == 2:
--> 365             x = _check_1d(tup[0])
    366             y = _check_1d(tup[-1])
    367         else:

C:\Anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in _check_1d(x)
   1375     else:
   1376         try:
-> 1377             x[:, None]
   1378             return x
   1379         except (IndexError, TypeError):

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2684             return self._getitem_frame(key)
   2685         elif is_mi_columns:
-> 2686             return self._getitem_multilevel(key)
   2687         else:
   2688             return self._getitem_column(key)

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_multilevel(self, key)
   2728 
   2729     def _getitem_multilevel(self, key):
-> 2730         loc = self.columns.get_loc(key)
   2731         if isinstance(loc, (slice, Series, np.ndarray, Index)):
   2732             new_columns = self.columns[loc]

C:\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py in get_loc(self, key, method)
   2246         if self.nlevels < keylen:
   2247             raise KeyError('Key length ({0}) exceeds index depth ({1})'
-> 2248                            ''.format(keylen, self.nlevels))
   2249 
   2250         if keylen == self.nlevels and self.is_unique:

KeyError: 'Key length (2) exceeds index depth (1)'

(8) 성능평가

다은은 생성된 회귀분석모델에 대한 성능평가를 진행하도록 하겠다.

print('score = ', linear_regression.score(X = pd.DataFrame(data_frame["RM"]), y = data_frame["price"]))

score함수로 예측한 결과 정확한 결과 값을 비교해서 성능을 평가한다. 파라미터를 살펴보면 독립변수 "RM"를 2차원 DataFrame 형태로 교체 후 X값으로 지정해주고 종속변수 "Price"값은 y값으로 지정해서 학습한 모델을 통해 성능을 평가한다.

print('Mean_Squared_Error = ', mean_squared_error(prediction, data_frame["Price"]))

sklearn 패키지에서 제공하는 mean_squared_error모듈을 이용해 평균제곱오차 값을 구합니다. 해당 모듈에 파라미터 값은 학습한 모델을 통해 나온 예측 값 prediction 변수와 실제 값이 저장된 "Price"값을 입력한다.

print('RMSE = ', mean_squared_error(prediction, data_frame["Price"])**0.5)

RMSE 값을 구한다. 구하는 방법은 세번째 줄에서 구한 평균제곱오차 값에 루트를 씌워주었다. 루트는 **0.5 수식으로 구하였다.

다음은 성능평가에 대한 내용이다.

In:

print('score = ', linear_regression.score(X = pd.DataFrame(data_frame["RM"]), y = data_frame["Price"]))

print('Mean_Squared_Error = ', mean_squared_error(prediction, data_frame["Price"]))

print('RMSE = ', mean_squared_error(prediction, data_frame["Price"])**0.5)

Out:

score = 0.4835254559913343
Mean_Squared_Error = 43.60055177116956
RMSE = 6.603071389222561

'beginner > 파이썬 분석' 카테고리의 다른 글

의사결정 트리 파이썬 코드 실습 (3)	2019.02.02
다중선형회귀분석 실습 (0)	2019.01.30
단일선형회귀분석 ① (0)	2019.01.24
서울시 범죄 현황 분석 (0)	2019.01.20
서울시 구별 CCTV 현황 분석 (0)	2019.01.18

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

글 보관함

조환희의 학습 블로그

티스토리 뷰

단일선형회귀분석 ②

'beginner > 파이썬 분석' 카테고리의 다른 글

티스토리툴바