랜덤 포레스트 Random Forest)¶

여러가지 머신러닝 모델을 연결하여 사용하는 것을 앙상블(ensemble) 방법이라고 한다.
결정트리를 사용한 앙상블에는 랜덤 포레스트와 그래디언트 부스팅 두가지 방법이 유명하다.
랜덤 포레스트는 결정트리를 여러개 만들고 각 결정 트리마다 무작위성(랜덤)을 부여하여 생성한다. 이렇게 나온 여러개의 결정트리의 결과를 평균하여 최종 결과를 낸다.
이는 결정트리의 과대적합 경향을 줄이고, 다양한 속성을 고려하게 된다는 의미가 된다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target)

model = RandomForestClassifier(n_estimators=100) # 디폴트 = 10
model.fit(X_train, y_train)

train_score = model.score(X_train,y_train)
test_score = model.score(X_test,y_test)
display(train_score,test_score)

1.0

0.986013986013986

model # help.model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

결과에서 보듯이 커널 SVM 보다 조금 더 좋은 성능을 보이는 것 같다.
나무의 갯수를 정하는 것은 n_estimators 속성이다. 나무의 갯수가 많을 수록 좀 더 일반화 되고 점수가 높아지는 경향이 있다.
랜덤 포레스트 명칭에서 랜덤의 의미를 설명하겠다. 먼저 전체 샘플이 1000개 라면 1000개의 샘플을 무작위로 뽑는다. 다만 같은 샘플이 반복해 선택될 수 있다(부트스트랩 샘플, 주머니에서 공을 꺼낸 후 다시 집어 넣는다.). 그리고 max_features 속성을 사용하여 각 노드에서 판단에 사용할 속성들을 무작위로 선정한다.

trees = model.estimators_
len(trees), trees[0]

(100,
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False,
             random_state=813200832, splitter='best'))

trees[0].predict(X_test)

array([0., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 1., 1., 1.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1.,
       0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1.,
       0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1.,
       1., 0., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 0.,
       0., 1., 1., 1., 0., 1., 1.])

trees[-1].predict(X_test)

array([0., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 1., 1., 1.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 1., 1., 1., 1., 1.,
       1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1., 1.,
       0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1.,
       0., 1., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1.,
       1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 0.,
       1., 0., 1., 1., 0., 0., 1.])

X = cancer.data[:,[0,1]]
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X,y)

model = RandomForestClassifier(n_estimators=100, max_features=2, max_depth=5) # 디폴트는 10
model.fit(X_train, y_train)

display(model.score(X_train, y_train), model.score(X_test, y_test))

0.9436619718309859

0.8951048951048951

import mglearn

plt.figure(figsize=[12,10])
mglearn.plots.plot_2d_classification(model,X)
mglearn.discrete_scatter(X[:,0],X[:,1],y,alpha=0.3)

[<matplotlib.lines.Line2D at 0x18019171cc0>,
 <matplotlib.lines.Line2D at 0x18019171b00>]

plt.figure(figsize=[12,10])
for i in range(5):
    plt.subplot(2,3,i+1)
    mglearn.plots.plot_tree_partition(X,y,model.estimators_[i])

# 6번째 플롯 그리기    
plt.subplot(2,3,6)
mglearn.plots.plot_2d_classification(model,X)
mglearn.discrete_scatter(X[:,0],X[:,1],y,alpha=0.2)

[<matplotlib.lines.Line2D at 0x18018fd4cc0>,
 <matplotlib.lines.Line2D at 0x18018fd4ba8>]

model = RandomForestClassifier(n_estimators=100, max_depth=5) #디폴트는 10
model.fit(cancer.data, cancer.target)

# 이 랜덤 포레스트 모델 생성시 가장 중요하게 영향을 끼친 속성
weight = model.feature_importances_

plt.figure(figsize=[10,10])
plt.barh(range(30),weight)
plt.yticks(range(30), ['%s(%d)' %(s,i) for i,s in enumerate(cancer.feature_names)], va='bottom')
print('')

지도학습-나이브베이즈 (0)	2019.03.30
지도학습 - 그래디언트 부스팅(별표) (0)	2019.03.17
지도학습 - 결정트리 (0)	2019.03.12
유방암 데이터 분석 by SVM (0)	2019.03.12
지도학습 - kernel SVM (0)	2019.03.07

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

조환희의 학습 블로그

티스토리 뷰

지도학습 - 랜덤포레스트

랜덤 포레스트 Random Forest)¶

'beginner > 파이썬 머신러닝 기초' 카테고리의 다른 글

티스토리툴바