결정 트리 (Decision Tree)¶

앞에서 배운 k-NN, 로지스틱회귀, SVM 등은 기하학적 분포와 거리 개념에 기반한 분류 알고리즘입니다.
반면, 결정트리와 이에 기반한 알고리즘 들은 스무고개 놀이와 같이 계속 질문을 던져 가면서 예/아니오 를 판단하는 방식입니다. '예' 그룹인지 '아니오' 그룹인지가 중요하지 거리 개념은 기본적으로 없습니다.
예측 점수를 가장 높게 하고 싶을 때 우선 적용을 고려하는 것이 커널 SVM 이나 결정 트리 방식입니다. 현재 여러 머신러닝 대회에서 가장 결과 점수가 높은 방식이 결정트리에 기반한 알고리즘입니다.
결정트리는 거리를 재지 않기 때문에 정규화가 필요하지 않습니다.
결정트리의 도전과제는 가장 잘 두 그룹으로 분류하는 속성이 무엇인지 그리고 그 경계값을 어떻게 빨리 알아내는 가에 있습니다.
아래는 Iris 데이터를 결정트리로 분류한 결과 예시입니다.

결정트리는 데이터의 각 속성 중 하나의 속성을 선택해 판단기준을 세워 왼쪽, 오른쪽으로 가지치기를 합니다.
그래서 교재에도 나오지만, 한번 가지치기 할 때마다 수평, 수직 선을 그어 분류기준을 세웁니다.
결정트리는 한없이 깊어질 수 있기 때문에 기본값으로 적용시 훈련세트에 대해서 예측점수는 항상 100% 입니다. 즉 과대적합 되는 경향이 있습니다.
그렇기 때문에 가지의 깊이나 잎 수를 제한하는 등의 일반화 적용이 필요해 집니다.

위의 그림에서 각 네모박스 들을 노드(node) 라고 하고, 처음 노드를 루트노드(root node) 라고 합니다. 총 노드수는 15개 입니다.
그리고 몇단계로 가지치기를 하는지를 깊이(depth) 라고 하는데, 위의 그림은 총 5번을 수행했으므로 깊이가 5가 됩니다.
그리고 끝 부분에 있는 노드들을 잎노드(leaf node) 라고 하는데, 판단이 마지막으로 결정되는 곳입니다. 위 그림에서는 총 8개입니다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
display(train_score, test_score)

1.0

0.951048951048951

시각화를 위해 속성을 2개로 제한해 보겠습니다.

col1 = 10
col2 = 25

X_train, X_test, y_train, y_test = train_test_split(cancer.data[:,[col1,col2]], cancer.target)

plt.scatter(X_train[:,0], X_train[:,1], c=y_train, alpha=0.3)
plt.colorbar()
plt.xlabel(cancer.feature_names[col1])
plt.ylabel(cancer.feature_names[col2])

Text(0,0.5,'worst compactness')

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
display(train_score, test_score)

1.0

0.8111888111888111

import mglearn

plt.figure(figsize=[10,8])
mglearn.plots.plot_2d_classification(model, X_train, cm='spring')
mglearn.discrete_scatter(X_train[:,0], X_train[:,1], y_train)

[<matplotlib.lines.Line2D at 0x29563728908>,
 <matplotlib.lines.Line2D at 0x29563728a20>]

로지스틱 회귀를 이용해 경계선을 구한 후 회전변환 해서 디시젼트리를 사용하면?

plt.figure(figsize=[10,8])
mglearn.plots.plot_2d_classification(model, X_train, cm='spring')
mglearn.discrete_scatter(X_test[:,0], X_test[:,1], y_test)
plt.title('Cancer Test Data')

Text(0.5,1,'Cancer Test Data')

def draw_result_map(X, y, model, cols=['',''], title=''):
    scale = 300
    xmax = X[:,0].max()+1
    xmin = X[:,0].min()-1
    ymax = X[:,1].max()+1
    ymin = X[:,1].min()-1

    xx = np.linspace(xmin,xmax,scale)
    yy = np.linspace(ymin,ymax,scale)
    data1, data2 = np.meshgrid(xx,yy)
    X_grid = np.c_[data1.ravel(), data2.ravel()]
    
    pred_y = model.predict(X_grid)

    CS = plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                    extent=[xmin,xmax,ymin,ymax], alpha=0.3, cmap='gray_r')

    #fig=plt.figure(figsize=[12,10])
    
    # draw X
    plt.scatter(X[:,0], X[:,1], c=y, s=60)

    plt.xlabel(cols[0])
    plt.ylabel(cols[1])
    #plt.colorbar(CS, shrink=0.3)
    plt.title(title,fontsize=20)

plt.figure(figsize=[12,10])

plt.subplot(1,2,1)
draw_result_map(X_train, y_train, model, [cancer.feature_names[col1],cancer.feature_names[col2]],
                'Decision Tree - Train Data')

plt.subplot(1,2,2)
draw_result_map(X_test, y_test, model, [cancer.feature_names[col1],cancer.feature_names[col2]],
                'Decision Tree - Test Data')

결정트리의 옵션들¶

결정트리의 기본은 모든 학습 데이터를 정확하게 분류할 때 까지 깊이가 깊어지므로 항상 학습 데이터에 대한 스코어가 100% 로 과적합됩니다.
아래에서 과적합을 피하기 위해 깊이를 제한하는 옵션인 max_depth 를 적용해 보겠습니다.

model

# gini 알고리즘을 사용하여 경계선 선택, 속도는 느리지만 엔트로피 알고리즘이 더 정확
# max_depth 몇층까지 내려갈거냐. 한번 분기가 1층

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target)

score_train = []
score_test = []

for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X_train, y_train)
    score1 = model.score(X_train,y_train)
    score2 = model.score(X_test, y_test)
    score_train.append(score1)
    score_test.append(score2)

plt.plot(range(1,10), score_train, 'ro--')
plt.plot(range(1,10), score_test, 'bs-')
plt.legend(['train','test'])
plt.xlabel('max_depth')

Text(0.5,0,'max_depth')

col1 = 0
col2 = 1

X_train, X_test, y_train, y_test = train_test_split(cancer.data[:,[col1,col2]], cancer.target)

plt.figure(figsize=[14,16])
for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X_train, y_train)
    
    plt.subplot(3,3,depth)
    plt.title('max_depth ='+str(depth))
    mglearn.plots.plot_2d_classification(model, X_train)
    mglearn.discrete_scatter(X_train[:,col1], X_train[:,col2], y_train, alpha=0.3)

plt.figure(figsize=[14,16])
for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X_train, y_train)
    
    plt.subplot(3,3,depth)
    draw_result_map(X_train, y_train, model, [cancer.feature_names[col1],cancer.feature_names[col2]],
                'cancer(depth=%d)' % depth)

그래프 출력을 위한 graphviz 설치 방법¶

In anaconda prompt, run "pip install graphviz"
Install graphviz for windows (http://www.graphviz.org)
Add the graphviz dir(C:\Program Files (x86)\Graphviz2.38\bin) to windows PATH
Restart your anaconda prompt and jupyter notebook

model = DecisionTreeClassifier(max_depth=3)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree2.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree2.dot') as f: #tree2.dot으로 저장
    dot_graph=f.read()
display(graphviz.Source(dot_graph))

worst radius로 분리하기 좋구나

# save dot to png
import graphviz
dot=graphviz.Source(dot_graph)
dot.format='png'
dot.render(filename='tree2') #tree2.png 저장

'tree2.png'

model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

help(DecisionTreeClassifier)

Help on class DecisionTreeClassifier in module sklearn.tree.tree:

class DecisionTreeClassifier(BaseDecisionTree, sklearn.base.ClassifierMixin)

Parameters

criterion : string, optional (default="gini") The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.

splitter : string, optional (default="best") The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.

max_depth : int or None, optional (default=None) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split : int, float, optional (default=2) The minimum number of samples required to split an internal node:

   - If int, then consider `min_samples_split` as the minimum number.
   - If float, then `min_samples_split` is a percentage and
     `ceil(min_samples_split * n_samples)` are the minimum
     number of samples for each split.

   .. versionchanged:: 0.18
      Added float values for percentages.

min_samples_leaf : int, float, optional (default=1) The minimum number of samples required to be at a leaf node:

   - If int, then consider `min_samples_leaf` as the minimum number.
   - If float, then `min_samples_leaf` is a percentage and
     `ceil(min_samples_leaf * n_samples)` are the minimum
     number of samples for each node.

   .. versionchanged:: 0.18
      Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.) The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features : int, float, string or None, optional (default=None) The number of features to consider when looking for the best split:

       - If int, then consider `max_features` features at each split.
       - If float, then `max_features` is a percentage and
         `int(max_features * n_features)` features are considered at each
         split.
       - If "auto", then `max_features=sqrt(n_features)`.
       - If "sqrt", then `max_features=sqrt(n_features)`.
       - If "log2", then `max_features=log2(n_features)`.
       - If None, then `max_features=n_features`.

   Note: the search for a split does not stop until at least one
   valid partition of the node samples is found, even if it requires to
   effectively inspect more than ``max_features`` features.

DecisionTreeClassifier 의 주목할 만한 옵션들을 알아보자.
max_leaf_nodes : 총 잎노드의 갯수를 제한한다
min_samples_split : 만일 10이라면, 노드의 샘플 갯수가 10이상이 되어야 분기를 한다.
min_samples_leaf : 10이라면, 잎노드의 샘플 갯수는 10이상이 되도록 만든다.
max_features : 분기를 할 때 고려하는 속성의 갯수이다. 만일 2라면 판단할 속성을 고를때 랜덤하게 속성 2개만 뽑아 그 중에서 기준을 세운다.
아래에서 확인해 보자.

help(model.tree_)

# tree_ 를 사용하면 전체 구조를 수치로서 알 수 있다.

class Tree(builtins.object)

node_count : int The number of nodes (internal nodes + leaves) in the tree.

capacity : int The current capacity (i.e., size) of the arrays, which is at least as great as node_count.

max_depth : int The maximal depth of the tree.

children_left : array of int, shape [node_count] children_left[i] holds the node id of the left child of node i. For leaves, children_left[i] == TREE_LEAF. Otherwise, children_left[i] > i. This child handles the case where X[:, feature[i]] <= threshold[i].

children_right : array of int, shape [node_count] children_right[i] holds the node id of the right child of node i. For leaves, children_right[i] == TREE_LEAF. Otherwise, children_right[i] > i. This child handles the case where X[:, feature[i]] > threshold[i].

feature : array of int, shape [node_count] feature[i] holds the feature to split on, for the internal node i.

threshold : array of double, shape [node_count] threshold[i] holds the threshold for the internal node i.

value : array of double, shape [node_count, n_outputs, max_n_classes] Contains the constant prediction value of each node.

impurity : array of double, shape [node_count] impurity[i] holds the impurity (i.e., the value of the splitting criterion) at node i.

n_node_samples : array of int, shape [node_count] n_node_samples[i] holds the number of training samples reaching node i.

weighted_n_node_samples : array of int, shape [node_count] weighted_n_node_samples[i] holds the weighted number of training samples reaching node i.

model = DecisionTreeClassifier(min_samples_leaf=10)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree1.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree1.dot') as f:
    dot_graph=f.read()
display(graphviz.Source(dot_graph))

model = DecisionTreeClassifier(min_samples_split=50)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree1.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree1.dot') as f:
    dot_graph=f.read()
display(graphviz.Source(dot_graph))

model = DecisionTreeClassifier(max_leaf_nodes=5)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree1.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree1.dot') as f:
    dot_graph=f.read()
display(graphviz.Source(dot_graph))

결정트리는 가장 잘 나누는 속성을 선택하고 경계값을 결정하기 위해 엔트로피(entropy) 라는 개념을 사용합니다.
$ E_{org} = - (p_1 log{p_1} + p_2 log{p_2}) $
$ E_{new} = - {N_1 \over N} (p_{11} log{p_{11}} + p_{12} log{p_{12}}) - {N_2 \over N} (p_{21} log{p_{21}} + p_{22} log{p_{22}}) $
$ E_{new} < E_{org} $
$ E_{new} = min_i(E_{new\_i}) $

엔트로피 : 혼잡도
자연계는 엔트로피 증가, 우리는 엔트로피 감소시키는게 목적
로그의 진수가 1보다 작으면 음수이므로 앞에 - 붙인다

1차원으로 축소
ㅁㅁㅇㅣㅁㅇㅁㅇㅇㅇ
ㅁㅁㅇㅁㅇㅣㅁㅇㅇㅇ
어디서 쪼개나

지도학습 - 그래디언트 부스팅(별표) (0)	2019.03.17
지도학습 - 랜덤포레스트 (0)	2019.03.17
유방암 데이터 분석 by SVM (0)	2019.03.12
지도학습 - kernel SVM (0)	2019.03.07
유방암 데이터 분석 (3)	2019.03.05

조환희의 학습 블로그

티스토리 뷰

지도학습 - 결정트리

결정 트리 (Decision Tree)¶

결정트리의 옵션들¶

그래프 출력을 위한 graphviz 설치 방법¶

'beginner > 파이썬 머신러닝 기초' 카테고리의 다른 글

티스토리툴바

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28