티스토리 뷰

2장_10_결정트리

결정 트리 (Decision Tree)

  • 앞에서 배운 k-NN, 로지스틱회귀, SVM 등은 기하학적 분포와 거리 개념에 기반한 분류 알고리즘입니다.
  • 반면, 결정트리와 이에 기반한 알고리즘 들은 스무고개 놀이와 같이 계속 질문을 던져 가면서 예/아니오 를 판단하는 방식입니다. '예' 그룹인지 '아니오' 그룹인지가 중요하지 거리 개념은 기본적으로 없습니다.
  • 예측 점수를 가장 높게 하고 싶을 때 우선 적용을 고려하는 것이 커널 SVM 이나 결정 트리 방식입니다. 현재 여러 머신러닝 대회에서 가장 결과 점수가 높은 방식이 결정트리에 기반한 알고리즘입니다.
  • 결정트리는 거리를 재지 않기 때문에 정규화가 필요하지 않습니다.
  • 결정트리의 도전과제는 가장 잘 두 그룹으로 분류하는 속성이 무엇인지 그리고 그 경계값을 어떻게 빨리 알아내는 가에 있습니다.
  • 아래는 Iris 데이터를 결정트리로 분류한 결과 예시입니다.
  • 결정트리는 데이터의 각 속성 중 하나의 속성을 선택해 판단기준을 세워 왼쪽, 오른쪽으로 가지치기를 합니다.
  • 그래서 교재에도 나오지만, 한번 가지치기 할 때마다 수평, 수직 선을 그어 분류기준을 세웁니다.
  • 결정트리는 한없이 깊어질 수 있기 때문에 기본값으로 적용시 훈련세트에 대해서 예측점수는 항상 100% 입니다. 즉 과대적합 되는 경향이 있습니다.
  • 그렇기 때문에 가지의 깊이나 잎 수를 제한하는 등의 일반화 적용이 필요해 집니다.
  • 위의 그림에서 각 네모박스 들을 노드(node) 라고 하고, 처음 노드를 루트노드(root node) 라고 합니다. 총 노드수는 15개 입니다.
  • 그리고 몇단계로 가지치기를 하는지를 깊이(depth) 라고 하는데, 위의 그림은 총 5번을 수행했으므로 깊이가 5가 됩니다.
  • 그리고 끝 부분에 있는 노드들을 잎노드(leaf node) 라고 하는데, 판단이 마지막으로 결정되는 곳입니다. 위 그림에서는 총 8개입니다.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier

cancer = load_breast_cancer()
In [2]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
display(train_score, test_score)
1.0
0.951048951048951
  • 시각화를 위해 속성을 2개로 제한해 보겠습니다.
In [3]:
col1 = 10
col2 = 25

X_train, X_test, y_train, y_test = train_test_split(cancer.data[:,[col1,col2]], cancer.target)
In [4]:
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, alpha=0.3)
plt.colorbar()
plt.xlabel(cancer.feature_names[col1])
plt.ylabel(cancer.feature_names[col2])
Out[4]:
Text(0,0.5,'worst compactness')
In [5]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
display(train_score, test_score)
1.0
0.8111888111888111
In [6]:
import mglearn

plt.figure(figsize=[10,8])
mglearn.plots.plot_2d_classification(model, X_train, cm='spring')
mglearn.discrete_scatter(X_train[:,0], X_train[:,1], y_train)
Out[6]:
[<matplotlib.lines.Line2D at 0x29563728908>,
 <matplotlib.lines.Line2D at 0x29563728a20>]

로지스틱 회귀를 이용해 경계선을 구한 후 회전변환 해서 디시젼트리를 사용하면?

In [7]:
plt.figure(figsize=[10,8])
mglearn.plots.plot_2d_classification(model, X_train, cm='spring')
mglearn.discrete_scatter(X_test[:,0], X_test[:,1], y_test)
plt.title('Cancer Test Data')
Out[7]:
Text(0.5,1,'Cancer Test Data')
In [8]:
def draw_result_map(X, y, model, cols=['',''], title=''):
    scale = 300
    xmax = X[:,0].max()+1
    xmin = X[:,0].min()-1
    ymax = X[:,1].max()+1
    ymin = X[:,1].min()-1

    xx = np.linspace(xmin,xmax,scale)
    yy = np.linspace(ymin,ymax,scale)
    data1, data2 = np.meshgrid(xx,yy)
    X_grid = np.c_[data1.ravel(), data2.ravel()]
    
    pred_y = model.predict(X_grid)

    CS = plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                    extent=[xmin,xmax,ymin,ymax], alpha=0.3, cmap='gray_r')

    #fig=plt.figure(figsize=[12,10])
    
    # draw X
    plt.scatter(X[:,0], X[:,1], c=y, s=60)

    plt.xlabel(cols[0])
    plt.ylabel(cols[1])
    #plt.colorbar(CS, shrink=0.3)
    plt.title(title,fontsize=20)
In [9]:
plt.figure(figsize=[12,10])

plt.subplot(1,2,1)
draw_result_map(X_train, y_train, model, [cancer.feature_names[col1],cancer.feature_names[col2]],
                'Decision Tree - Train Data')

plt.subplot(1,2,2)
draw_result_map(X_test, y_test, model, [cancer.feature_names[col1],cancer.feature_names[col2]],
                'Decision Tree - Test Data')

결정트리의 옵션들

  • 결정트리의 기본은 모든 학습 데이터를 정확하게 분류할 때 까지 깊이가 깊어지므로 항상 학습 데이터에 대한 스코어가 100% 로 과적합됩니다.
  • 아래에서 과적합을 피하기 위해 깊이를 제한하는 옵션인 max_depth 를 적용해 보겠습니다.
In [10]:
model

# gini 알고리즘을 사용하여 경계선 선택, 속도는 느리지만 엔트로피 알고리즘이 더 정확
# max_depth 몇층까지 내려갈거냐. 한번 분기가 1층
Out[10]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [11]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target)

score_train = []
score_test = []

for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X_train, y_train)
    score1 = model.score(X_train,y_train)
    score2 = model.score(X_test, y_test)
    score_train.append(score1)
    score_test.append(score2)
    
In [12]:
plt.plot(range(1,10), score_train, 'ro--')
plt.plot(range(1,10), score_test, 'bs-')
plt.legend(['train','test'])
plt.xlabel('max_depth')
Out[12]:
Text(0.5,0,'max_depth')
In [13]:
col1 = 0
col2 = 1

X_train, X_test, y_train, y_test = train_test_split(cancer.data[:,[col1,col2]], cancer.target)

plt.figure(figsize=[14,16])
for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X_train, y_train)
    
    plt.subplot(3,3,depth)
    plt.title('max_depth ='+str(depth))
    mglearn.plots.plot_2d_classification(model, X_train)
    mglearn.discrete_scatter(X_train[:,col1], X_train[:,col2], y_train, alpha=0.3)
In [14]:
plt.figure(figsize=[14,16])
for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth=depth)
    model.fit(X_train, y_train)
    
    plt.subplot(3,3,depth)
    draw_result_map(X_train, y_train, model, [cancer.feature_names[col1],cancer.feature_names[col2]],
                'cancer(depth=%d)' % depth)

그래프 출력을 위한 graphviz 설치 방법

  1. In anaconda prompt, run "pip install graphviz"
  2. Install graphviz for windows (http://www.graphviz.org)
  3. Add the graphviz dir(C:\Program Files (x86)\Graphviz2.38\bin) to windows PATH
  4. Restart your anaconda prompt and jupyter notebook
In [15]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree2.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)
In [16]:
import graphviz

with open('tree2.dot') as f: #tree2.dot으로 저장
    dot_graph=f.read()
display(graphviz.Source(dot_graph))
Tree 0 worst radius <= 16.795 samples = 569 value = [212, 357] class = benign 1 worst concave points <= 0.136 samples = 379 value = [33, 346] class = benign 0->1 True 8 worst texture <= 19.91 samples = 190 value = [179, 11] class = malignant 0->8 False 2 symmetry error <= 0.009 samples = 333 value = [5, 328] class = benign 1->2 5 worst texture <= 25.67 samples = 46 value = [28, 18] class = malignant 1->5 3 samples = 1 value = [1, 0] class = malignant 2->3 4 samples = 332 value = [4, 328] class = benign 2->4 6 samples = 19 value = [4, 15] class = benign 5->6 7 samples = 27 value = [24, 3] class = malignant 5->7 9 worst concave points <= 0.145 samples = 17 value = [8, 9] class = benign 8->9 12 worst smoothness <= 0.088 samples = 173 value = [171, 2] class = malignant 8->12 10 samples = 9 value = [0, 9] class = benign 9->10 11 samples = 8 value = [8, 0] class = malignant 9->11 13 samples = 1 value = [0, 1] class = benign 12->13 14 samples = 172 value = [171, 1] class = malignant 12->14

worst radius로 분리하기 좋구나

In [17]:
# save dot to png
import graphviz
dot=graphviz.Source(dot_graph)
dot.format='png'
dot.render(filename='tree2') #tree2.png 저장
Out[17]:
'tree2.png'
In [20]:
model
Out[20]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

help(DecisionTreeClassifier)

Help on class DecisionTreeClassifier in module sklearn.tree.tree:

class DecisionTreeClassifier(BaseDecisionTree, sklearn.base.ClassifierMixin)

Parameters

criterion : string, optional (default="gini") The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.

splitter : string, optional (default="best") The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.

max_depth : int or None, optional (default=None) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split : int, float, optional (default=2) The minimum number of samples required to split an internal node:

   - If int, then consider `min_samples_split` as the minimum number.
   - If float, then `min_samples_split` is a percentage and
     `ceil(min_samples_split * n_samples)` are the minimum
     number of samples for each split.

   .. versionchanged:: 0.18
      Added float values for percentages.

min_samples_leaf : int, float, optional (default=1) The minimum number of samples required to be at a leaf node:

   - If int, then consider `min_samples_leaf` as the minimum number.
   - If float, then `min_samples_leaf` is a percentage and
     `ceil(min_samples_leaf * n_samples)` are the minimum
     number of samples for each node.

   .. versionchanged:: 0.18
      Added float values for percentages.

min_weight_fraction_leaf : float, optional (default=0.) The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features : int, float, string or None, optional (default=None) The number of features to consider when looking for the best split:

       - If int, then consider `max_features` features at each split.
       - If float, then `max_features` is a percentage and
         `int(max_features * n_features)` features are considered at each
         split.
       - If "auto", then `max_features=sqrt(n_features)`.
       - If "sqrt", then `max_features=sqrt(n_features)`.
       - If "log2", then `max_features=log2(n_features)`.
       - If None, then `max_features=n_features`.

   Note: the search for a split does not stop until at least one
   valid partition of the node samples is found, even if it requires to
   effectively inspect more than ``max_features`` features.
  • DecisionTreeClassifier 의 주목할 만한 옵션들을 알아보자.
  • max_leaf_nodes : 총 잎노드의 갯수를 제한한다
  • min_samples_split : 만일 10이라면, 노드의 샘플 갯수가 10이상이 되어야 분기를 한다.
  • min_samples_leaf : 10이라면, 잎노드의 샘플 갯수는 10이상이 되도록 만든다.
  • max_features : 분기를 할 때 고려하는 속성의 갯수이다. 만일 2라면 판단할 속성을 고를때 랜덤하게 속성 2개만 뽑아 그 중에서 기준을 세운다.
  • 아래에서 확인해 보자.

help(model.tree_)

# tree_ 를 사용하면 전체 구조를 수치로서 알 수 있다.

class Tree(builtins.object)

node_count : int The number of nodes (internal nodes + leaves) in the tree.

capacity : int The current capacity (i.e., size) of the arrays, which is at least as great as node_count.

max_depth : int The maximal depth of the tree.

children_left : array of int, shape [node_count] children_left[i] holds the node id of the left child of node i. For leaves, children_left[i] == TREE_LEAF. Otherwise, children_left[i] > i. This child handles the case where X[:, feature[i]] <= threshold[i].

children_right : array of int, shape [node_count] children_right[i] holds the node id of the right child of node i. For leaves, children_right[i] == TREE_LEAF. Otherwise, children_right[i] > i. This child handles the case where X[:, feature[i]] > threshold[i].

feature : array of int, shape [node_count] feature[i] holds the feature to split on, for the internal node i.

threshold : array of double, shape [node_count] threshold[i] holds the threshold for the internal node i.

value : array of double, shape [node_count, n_outputs, max_n_classes] Contains the constant prediction value of each node.

impurity : array of double, shape [node_count] impurity[i] holds the impurity (i.e., the value of the splitting criterion) at node i.

n_node_samples : array of int, shape [node_count] n_node_samples[i] holds the number of training samples reaching node i.

weighted_n_node_samples : array of int, shape [node_count] weighted_n_node_samples[i] holds the weighted number of training samples reaching node i.

In [18]:
model = DecisionTreeClassifier(min_samples_leaf=10)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree1.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree1.dot') as f:
    dot_graph=f.read()
display(graphviz.Source(dot_graph))
Tree 0 worst radius <= 16.795 samples = 569 value = [212, 357] class = benign 1 worst concave points <= 0.136 samples = 379 value = [33, 346] class = benign 0->1 True 16 mean texture <= 16.11 samples = 190 value = [179, 11] class = malignant 0->16 False 2 area error <= 38.605 samples = 333 value = [5, 328] class = benign 1->2 11 worst texture <= 25.67 samples = 46 value = [28, 18] class = malignant 1->11 3 smoothness error <= 0.003 samples = 319 value = [2, 317] class = benign 2->3 10 samples = 14 value = [3, 11] class = benign 2->10 4 samples = 10 value = [1, 9] class = benign 3->4 5 worst texture <= 33.27 samples = 309 value = [1, 308] class = benign 3->5 6 samples = 289 value = [0, 289] class = benign 5->6 7 mean smoothness <= 0.084 samples = 20 value = [1, 19] class = benign 5->7 8 samples = 10 value = [0, 10] class = benign 7->8 9 samples = 10 value = [1, 9] class = benign 7->9 12 samples = 19 value = [4, 15] class = benign 11->12 13 mean concavity <= 0.11 samples = 27 value = [24, 3] class = malignant 11->13 14 samples = 10 value = [7, 3] class = malignant 13->14 15 samples = 17 value = [17, 0] class = malignant 13->15 17 samples = 17 value = [8, 9] class = benign 16->17 18 worst concave points <= 0.11 samples = 173 value = [171, 2] class = malignant 16->18 19 samples = 10 value = [8, 2] class = malignant 18->19 20 samples = 163 value = [163, 0] class = malignant 18->20
In [21]:
model = DecisionTreeClassifier(min_samples_split=50)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree1.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree1.dot') as f:
    dot_graph=f.read()
display(graphviz.Source(dot_graph))
Tree 0 worst radius <= 16.795 samples = 569 value = [212, 357] class = benign 1 worst concave points <= 0.136 samples = 379 value = [33, 346] class = benign 0->1 True 12 mean texture <= 16.11 samples = 190 value = [179, 11] class = malignant 0->12 False 2 area error <= 91.555 samples = 333 value = [5, 328] class = benign 1->2 11 samples = 46 value = [28, 18] class = malignant 1->11 3 area error <= 38.605 samples = 332 value = [4, 328] class = benign 2->3 10 samples = 1 value = [1, 0] class = malignant 2->10 4 smoothness error <= 0.003 samples = 319 value = [2, 317] class = benign 3->4 9 samples = 13 value = [2, 11] class = benign 3->9 5 samples = 7 value = [1, 6] class = benign 4->5 6 worst texture <= 33.27 samples = 312 value = [1, 311] class = benign 4->6 7 samples = 292 value = [0, 292] class = benign 6->7 8 samples = 20 value = [1, 19] class = benign 6->8 13 samples = 17 value = [8, 9] class = benign 12->13 14 worst smoothness <= 0.088 samples = 173 value = [171, 2] class = malignant 12->14 15 samples = 1 value = [0, 1] class = benign 14->15 16 worst concavity <= 0.18 samples = 172 value = [171, 1] class = malignant 14->16 17 samples = 4 value = [3, 1] class = malignant 16->17 18 samples = 168 value = [168, 0] class = malignant 16->18
In [22]:
model = DecisionTreeClassifier(max_leaf_nodes=5)
model.fit(cancer.data, cancer.target)

from sklearn.tree import export_graphviz

export_graphviz(model,out_file='tree1.dot',class_names=cancer.target_names,
                feature_names=cancer.feature_names,impurity=False,filled=True)

import graphviz

with open('tree1.dot') as f:
    dot_graph=f.read()
display(graphviz.Source(dot_graph))
Tree 0 worst radius <= 16.795 samples = 569 value = [212, 357] class = benign 1 worst concave points <= 0.136 samples = 379 value = [33, 346] class = benign 0->1 True 2 mean texture <= 16.11 samples = 190 value = [179, 11] class = malignant 0->2 False 3 samples = 333 value = [5, 328] class = benign 1->3 4 worst texture <= 25.67 samples = 46 value = [28, 18] class = malignant 1->4 5 samples = 19 value = [4, 15] class = benign 4->5 6 samples = 27 value = [24, 3] class = malignant 4->6 7 samples = 17 value = [8, 9] class = benign 2->7 8 samples = 173 value = [171, 2] class = malignant 2->8
  • 결정트리는 가장 잘 나누는 속성을 선택하고 경계값을 결정하기 위해 엔트로피(entropy) 라는 개념을 사용합니다.

    $ E_{org} = - (p_1 log{p_1} + p_2 log{p_2}) $
    $ E_{new} = - {N_1 \over N} (p_{11} log{p_{11}} + p_{12} log{p_{12}}) - {N_2 \over N} (p_{21} log{p_{21}} + p_{22} log{p_{22}}) $
    $ E_{new} < E_{org} $
    $ E_{new} = min_i(E_{new\_i}) $

엔트로피 : 혼잡도
자연계는 엔트로피 증가, 우리는 엔트로피 감소시키는게 목적
로그의 진수가 1보다 작으면 음수이므로 앞에 - 붙인다

1차원으로 축소
ㅁㅁㅇㅣㅁㅇㅁㅇㅇㅇ
ㅁㅁㅇㅁㅇㅣㅁㅇㅇㅇ
어디서 쪼개나

공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/12   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
글 보관함