k-NN 분류 (최근접 이웃)¶

k-NN 은 대표적인 분류 알고리즘 중의 하나이다.
k-NN 에서 NN 은 Nearest Neighbors 즉, 가장 가까운 점들이라는 의미이며, k 는 가장 가까운 이웃의 갯수를 의미한다.
예측하려고 하는 점 주위에 가장 가까이 있는 점들의 타겟값(클래스) 를 비교하여 해당 클래스를 판정한다.

주의 : 가깝다 멀다를 평가하는 지표를 거리(distance) 라고 한다. 거리를 어떻게 측정하느냐에 따라 다양한 방식을 적용할 수 있다. 하지만 일반적으로는 좌표상의 거리로 생각하자.

참고 :
scikit-learn 메뉴얼 http://scikit-learn.org/stable/modules/neighbors.html
wekipedia https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

k-NN 은 가장 직관적이고 알고리즘이 간단하여 가장 많이 애용되는 분류 알고리즘이다.
하지만 전문적으로 들어가면 거리를 측정하는 방식, 점들마다 가중치를 부여하는 방식 등 다양한 옵션을 줄 수 있어 그렇게 간단하지만도 않다.

(출처 : wikipedia)

위의 그림을 예로 들어 k-NN 을 설명하겠다.
파란색 클래스가 6개, 빨간색 클래스가 5개 있다. 여기서 녹색점은 빨간색으로 예측해야 할까 아니면 파란색으로 예측해야 할까?
k=3 인 경우 녹색점에서 가장 가까운 점 3개를 비교한다. 빨간 점이 많으므로 녹색점은 빨간색 클래스로 판정한다.
k=5 인 경우 가까운 5개의 점 중에서 파란색이 3개로 많으므로 녹색점을 파란색 클래스로 판정한다.

(출처 : wikipedia.org)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
pd.plotting.scatter_matrix(iris_df, c=iris.target, s=60, alpha=0.8, figsize=[12,12])

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000282E62F1630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9CA0B70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9CCEE80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9D03240>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9D278D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9D27908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9D7E630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9DA5CC0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9DD5390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9DFDA20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9E2F0F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9E56780>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9E7DE10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9EAF4E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9ED5B70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000282E9F06240>]],
      dtype=object)

위의 산점도맵에서 petal width 와 sepal length 를 선택하여 k-NN 을 적용해 보겠다. 속성을 2가지만 선택한 이유는 결과를 시각화 하기 위해서이다.
train_test_split() 에서 random_state 값을 지정한 이유는 항상 똑같은 결과를 얻고 확인하기 위해서이다

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

col1 = 3 # petal width
col2 = 0 # sepal length

X = iris.data[:,[col1, col2]] # 속성 2개만 골라냄
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2018)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((112, 2), (38, 2), (112,), (38,))

plt.scatter(X_train[:,0], X_train[:,1], c=y_train)
plt.xlabel(iris.feature_names[col1])
plt.ylabel(iris.feature_names[col2])
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x298126e7a90>

아래에서 k-NN 을 적용한다. 먼저 k=1 인 경우를 알아보자.
k-NN 모델은 sklearn.neighbors.KNeighborsClassifier 을 사용한다. (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
아래 소스의 중간의 복잡한 부분은 경계영역을 그리기 위한 것이다. 당장 다 이해할 필요는 없겠다.

model = KNeighborsClassifier(n_neighbors=1) # 1!!!
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
display(score)

0.9210526315789473

#help(mglearn.plot_2d_separator)

Help on function scatter in module matplotlib.pyplot:

scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs)
    A scatter plot of *y* vs *x* with varying marker size and/or color.
    
    Parameters
    ----------
    x, y : array_like, shape (n, )
        The data positions.
    
    s : scalar or array_like, shape (n, ), optional
        The marker size in points**2.
        Default is ``rcParams['lines.markersize'] ** 2``.
    
    c : color, sequence, or sequence of color, optional, default: 'b'
        The marker color. Possible values:
    
        - A single color format string.
        - A sequence of color specifications of length n.
        - A sequence of n numbers to be mapped to colors using *cmap* and
          *norm*.
        - A 2-D array in which the rows are RGB or RGBA.
    
        Note that *c* should not be a single numeric RGB or RGBA sequence
        because that is indistinguishable from an array of values to be
        colormapped. If you want to specify the same RGB or RGBA value for
        all points, use a 2-D array with a single row.
    
    marker : `~matplotlib.markers.MarkerStyle`, optional, default: 'o'
        The marker style. *marker* can be either an instance of the class
        or the text shorthand for a particular marker.
        See `~matplotlib.markers` for more information marker styles.
    
    cmap : `~matplotlib.colors.Colormap`, optional, default: None
        A `.Colormap` instance or registered colormap name. *cmap* is only
        used if *c* is an array of floats. If ``None``, defaults to rc
        ``image.cmap``.
    
    norm : `~matplotlib.colors.Normalize`, optional, default: None
        A `.Normalize` instance is used to scale luminance data to 0, 1.
        *norm* is only used if *c* is an array of floats. If *None*, use
        the default `.colors.Normalize`.
    
    vmin, vmax : scalar, optional, default: None
        *vmin* and *vmax* are used in conjunction with *norm* to normalize
        luminance data. If None, the respective min and max of the color
        array is used. *vmin* and *vmax* are ignored if you pass a *norm*
        instance.
    
    alpha : scalar, optional, default: None
        The alpha blending value, between 0 (transparent) and 1 (opaque).
    
    linewidths : scalar or array_like, optional, default: None
        The linewidth of the marker edges. Note: The default *edgecolors*
        is 'face'. You may want to change this as well.
        If *None*, defaults to rcParams ``lines.linewidth``.
    
    verts : sequence of (x, y), optional
        If *marker* is *None*, these vertices will be used to construct
        the marker.  The center of the marker is located at (0, 0) in
        normalized units.  The overall marker is rescaled by *s*.
    
    edgecolors : color or sequence of color, optional, default: 'face'
        The edge color of the marker. Possible values:
    
        - 'face': The edge color will always be the same as the face color.
        - 'none': No patch boundary will be drawn.
        - A matplotib color.
    
        For non-filled markers, the *edgecolors* kwarg is ignored and
        forced to 'face' internally.
    
    Returns
    -------
    paths : `~matplotlib.collections.PathCollection`
    
    Other Parameters
    ----------------
    **kwargs : `~matplotlib.collections.Collection` properties
    
    See Also
    --------
    plot : To plot scatter plots when markers are identical in size and
        color.
    
    Notes
    -----
    
    * The `.plot` function will be faster for scatterplots where markers
      don't vary in size or color.
    
    * Any or all of *x*, *y*, *s*, and *c* may be masked arrays, in which
      case all masks will be combined and only unmasked points will be
      plotted.
    
    * Fundamentally, scatter works with 1-D arrays; *x*, *y*, *s*, and *c*
      may be input as 2-D arrays, but within scatter they will be
      flattened. The exception is *c*, which will be flattened only if its
      size matches the size of *x* and *y*.
    
    .. note::
        In addition to the above described arguments, this function can take a
        **data** keyword argument. If such a **data** argument is given, the
        following arguments are replaced by **data[<arg>]**:
    
        * All arguments with the following names: 'c', 'color', 'edgecolors', 'facecolor', 'facecolors', 'linewidths', 's', 'x', 'y'.

import mglearn

plt.figure(figsize=[10,8])
mglearn.plots.plot_2d_classification(model, X_train, fill=True, eps=0.5, alpha=0.4)
mglearn.discrete_scatter(X_train[:,0], X_train[:,1], y_train)

[<matplotlib.lines.Line2D at 0x29812f9bda0>,
 <matplotlib.lines.Line2D at 0x29812f9beb8>,
 <matplotlib.lines.Line2D at 0x29812fa33c8>]

# draw boundary
scale = 100
xmax = X_train[:,0].max()+1
xmin = X_train[:,0].min()-1
ymax = X_train[:,1].max()+1
ymin = X_train[:,1].min()-1

xx = np.linspace(xmin,xmax,scale)
yy = np.linspace(ymin,ymax,scale)
data1, data2 = np.meshgrid(xx,yy)
X_grid = np.c_[data1.ravel(), data2.ravel()]
pred_y = model.predict(X_grid)

fig=plt.figure(figsize=[10,10])

plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                extent=[xmin,xmax,ymin,ymax], alpha=0.5, cmap='gray_r')

# draw X_train
plt.scatter(X_train[:,0], X_train[:,1], c=y_train)
#plt.scatter(X_test[:,0], X_test[:,1], c=y_test)
plt.colorbar()

plt.xlabel(iris.feature_names[col1])
plt.ylabel(iris.feature_names[col2])
plt.title('1-NN : iris (petal width vs sepal length)',fontsize=20)

Text(0.5,1,'1-NN : iris (petal width vs sepal length)')

# draw boundary
scale = 100
xmax = X_train[:,0].max()+1
xmin = X_train[:,0].min()-1
ymax = X_train[:,1].max()+1
ymin = X_train[:,1].min()-1

xx = np.linspace(xmin,xmax,scale)
yy = np.linspace(ymin,ymax,scale)
data1, data2 = np.meshgrid(xx,yy)
X_grid = np.c_[data1.ravel(), data2.ravel()]
pred_y = model.predict(X_grid)

fig=plt.figure(figsize=[10,10])

plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                extent=[xmin,xmax,ymin,ymax], alpha=0.5, cmap='gray_r')

# draw X_train
#plt.scatter(X_train[:,0], X_train[:,1], c=y_train)
plt.scatter(X_test[:,0], X_test[:,1], c=y_test)
plt.colorbar()

plt.xlabel(iris.feature_names[col1])
plt.ylabel(iris.feature_names[col2])
plt.title('1-NN : iris (petal width vs sepal length)',fontsize=20)

Text(0.5,1,'1-NN : iris (petal width vs sepal length)')

아래는 k=3 인 경우이다.

model = KNeighborsClassifier(n_neighbors=3) # 3!!!
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
display(score)

# draw boundary
scale = 300
xmax = X_train[:,0].max()+1
xmin = X_train[:,0].min()-1
ymax = X_train[:,1].max()+1
ymin = X_train[:,1].min()-1

xx = np.linspace(xmin,xmax,scale)
yy = np.linspace(ymin,ymax,scale)
data1, data2 = np.meshgrid(xx,yy)
X_grid = np.c_[data1.ravel(), data2.ravel()]
pred_y = model.predict(X_grid)

fig=plt.figure(figsize=[12,10])

plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                extent=[xmin,xmax,ymin,ymax], alpha=0.5, cmap='gray_r')

# draw X_train
#plt.scatter(X_train[:,0], X_train[:,1], c=y_train)
plt.scatter(X_test[:,0], X_test[:,1], c=y_test)

plt.xlabel(iris.feature_names[col1])
plt.ylabel(iris.feature_names[col2])
plt.title('3-NN : iris (petal width vs sepal length)',fontsize=20)

1.0

Text(0.5,1,'3-NN : iris (petal width vs sepal length)')

아래는 k=5 인 경우이다.

X_train, X_test, y_train, y_test = train_test_split(X, y)

model = KNeighborsClassifier(n_neighbors=1) # 5!!!
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
display(score)

# draw boundary
scale = 300
xmax = X_train[:,0].max()+1
xmin = X_train[:,0].min()-1
ymax = X_train[:,1].max()+1
ymin = X_train[:,1].min()-1

xx = np.linspace(xmin,xmax,scale)
yy = np.linspace(ymin,ymax,scale)
data1, data2 = np.meshgrid(xx,yy)
X_grid = np.c_[data1.ravel(), data2.ravel()]
pred_y = model.predict(X_grid)

fig=plt.figure(figsize=[12,10])

plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                extent=[xmin,xmax,ymin,ymax], alpha=0.5, cmap='gray_r')

# draw X_train
plt.scatter(X_train[:,0], X_train[:,1], c=y_train)

plt.xlabel(iris.feature_names[col1])
plt.ylabel(iris.feature_names[col2])
plt.title('5-NN : iris (petal width vs sepal length)',fontsize=20)

0.9210526315789473

Text(0.5,1,'5-NN : iris (petal width vs sepal length)')

아래는 k=5 인 경우 제대로 예측하지 못한 점이 있다. 이것이 무엇인지 알아 보기 위해 테스트세트를 그려보자.

# draw boundary
scale = 300
xmax = X_train[:,0].max()+1
xmin = X_train[:,0].min()-1
ymax = X_train[:,1].max()+1
ymin = X_train[:,1].min()-1

xx = np.linspace(xmin,xmax,scale)
yy = np.linspace(ymin,ymax,scale)
data1, data2 = np.meshgrid(xx,yy)
X_grid = np.c_[data1.ravel(), data2.ravel()]
pred_y = model.predict(X_grid)

fig=plt.figure(figsize=[12,10])

plt.imshow(pred_y.reshape(scale,scale), interpolation=None, origin='lower',
                extent=[xmin,xmax,ymin,ymax], alpha=0.5, cmap='gray_r')

# draw X_test!!!
plt.scatter(X_test[:,0], X_test[:,1], c=y_test)

plt.xlabel(iris.feature_names[col1])
plt.ylabel(iris.feature_names[col2])
plt.title('5-NN : test data',fontsize=20)

Text(0.5,1,'5-NN : test data')

경계 근처에서 한 점이 확인된다. 해당 점의 자세한 정보를 아래에서 확인할 수 있다.

pred_y = model.predict(X_test)
pred_y

array([1, 1, 1, 0, 0, 2, 2, 2, 2, 2, 2, 0, 1, 1, 0, 1, 2, 0, 2, 2, 2, 0,
       2, 0, 2, 2, 0, 0, 2, 0, 0, 0, 2, 1, 1, 2, 2, 2])

y_test

array([1, 1, 1, 0, 0, 1, 2, 2, 1, 2, 2, 0, 1, 1, 0, 1, 2, 0, 2, 2, 2, 0,
       2, 0, 1, 2, 0, 0, 2, 0, 0, 0, 2, 1, 1, 2, 2, 2])

display(np.where(pred_y != y_test)) # 예측이 틀린 점의 index 를 출력
display(X_test[pred_y != y_test], y_test[pred_y != y_test]) # 해당 점의 좌표와 타겟값 출력
display(pred_y[pred_y != y_test]) # 예측값과 실제값을 비교

(array([ 5,  8, 24], dtype=int64),)

array([[1.7, 6.7],
       [1.5, 6. ],
       [1.5, 6.2]])

array([1, 1, 1])

array([2, 2, 2])

(pred_y==y_test).mean() # score 계산법

0.9210526315789473

help(KNeighborsClassifier)

Help on class KNeighborsClassifier in module sklearn.neighbors.classification:

class KNeighborsClassifier(sklearn.neighbors.base.NeighborsBase, sklearn.neighbors.base.KNeighborsMixin, sklearn.neighbors.base.SupervisedIntegerMixin, sklearn.base.ClassifierMixin)
 |  KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
 |  
 |  Classifier implementing the k-nearest neighbors vote.
 |  
 |  Read more in the :ref:`User Guide <classification>`.
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, optional (default = 5)
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 |  
 |  weights : str or callable, optional (default = 'uniform') # 거리에 따른 가중치
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhood
 |        are weighted equally.
 |      - 'distance' : weight points by the inverse of their distance.
 |        in this case, closer neighbors of a query point will have a
 |        greater influence than neighbors which are further away.
 |      - [callable] : a user-defined function which accepts an
 |        array of distances, and returns an array of the same shape
 |        containing the weights.
 |  
 |  algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
 |      Algorithm used to compute the nearest neighbors:
 |  
 |      - 'ball_tree' will use :class:`BallTree`
 |      - 'kd_tree' will use :class:`KDTree`
 |      - 'brute' will use a brute-force search.
 |      - 'auto' will attempt to decide the most appropriate algorithm
 |        based on the values passed to :meth:`fit` method.
 |  
 |      Note: fitting on sparse input will override the setting of
 |      this parameter, using brute force.
 |  
 |  leaf_size : int, optional (default = 30)
 |      Leaf size passed to BallTree or KDTree.  This can affect the
 |      speed of the construction and query, as well as the memory
 |      required to store the tree.  The optimal value depends on the
 |      nature of the problem.
 |  
 |  p : integer, optional (default = 2)
 |      Power parameter for the Minkowski metric. When p = 1, this is
 |      equivalent to using manhattan_distance (l1), and euclidean_distance
 |      (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
 |  
 |  metric : string or callable, default 'minkowski'
 |      the distance metric to use for the tree.  The default metric is
 |      minkowski, and with p=2 is equivalent to the standard Euclidean
 |      metric. See the documentation of the DistanceMetric class for a
 |      list of available metrics.
 |  
 |  metric_params : dict, optional (default = None)
 |      Additional keyword arguments for the metric function.
 |  
 |  n_jobs : int, optional (default = 1)  # 병렬연산, cpu가 여러개 일 때
 |      The number of parallel jobs to run for neighbors search.
 |      If ``-1``, then the number of jobs is set to the number of CPU cores.
 |      Doesn't affect :meth:`fit` method.
 |  
 |  Examples
 |  --------
 |  >>> X = [[0], [1], [2], [3]]
 |  >>> y = [0, 0, 1, 1]
 |  >>> from sklearn.neighbors import KNeighborsClassifier
 |  >>> neigh = KNeighborsClassifier(n_neighbors=3)
 |  >>> neigh.fit(X, y) # doctest: +ELLIPSIS
 |  KNeighborsClassifier(...)
 |  >>> print(neigh.predict([[1.1]]))
 |  [0]
 |  >>> print(neigh.predict_proba([[0.9]]))
 |  [[ 0.66666667  0.33333333]]
 |  
 |  See also
 |  --------
 |  RadiusNeighborsClassifier
 |  KNeighborsRegressor
 |  RadiusNeighborsRegressor
 |  NearestNeighbors
 |  
 |  Notes
 |  -----
 |  See :ref:`Nearest Neighbors <neighbors>` in the online documentation
 |  for a discussion of the choice of ``algorithm`` and ``leaf_size``.
 |  
 |  .. warning::
 |  
 |     Regarding the Nearest Neighbors algorithms, if it is found that two
 |     neighbors, neighbor `k+1` and `k`, have identical distances
 |     but different labels, the results will depend on the ordering of the
 |     training data.
 |  
 |  https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
 |  
 |  Method resolution order:
 |      KNeighborsClassifier
 |      sklearn.neighbors.base.NeighborsBase
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.neighbors.base.KNeighborsMixin
 |      sklearn.neighbors.base.SupervisedIntegerMixin
 |      sklearn.base.ClassifierMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  predict(self, X)
 |      Predict the class labels for the provided data
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          Test samples.
 |      
 |      Returns
 |      -------
 |      y : array of shape [n_samples] or [n_samples, n_outputs]
 |          Class labels for each data sample.
 |  
 |  predict_proba(self, X)
 |      Return probability estimates for the test data X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          Test samples.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          of such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. Classes are ordered
 |          by lexicographic order.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.neighbors.base.KNeighborsMixin:
 |  
 |  kneighbors(self, X=None, n_neighbors=None, return_distance=True)
 |      Finds the K-neighbors of a point.
 |      
 |      Returns indices of and distances to the neighbors of each point.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          The query point or points.
 |          If not provided, neighbors of each indexed point are returned.
 |          In this case, the query point is not considered its own neighbor.
 |      
 |      n_neighbors : int
 |          Number of neighbors to get (default is the value
 |          passed to the constructor).
 |      
 |      return_distance : boolean, optional. Defaults to True.
 |          If False, distances will not be returned
 |      
 |      Returns
 |      -------
 |      dist : array
 |          Array representing the lengths to points, only present if
 |          return_distance=True
 |      
 |      ind : array
 |          Indices of the nearest points in the population matrix.
 |      
 |      Examples
 |      --------
 |      In the following example, we construct a NeighborsClassifier
 |      class from an array representing our data set and ask who's
 |      the closest point to [1,1,1]
 |      
 |      >>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
 |      >>> from sklearn.neighbors import NearestNeighbors
 |      >>> neigh = NearestNeighbors(n_neighbors=1)
 |      >>> neigh.fit(samples) # doctest: +ELLIPSIS
 |      NearestNeighbors(algorithm='auto', leaf_size=30, ...)
 |      >>> print(neigh.kneighbors([[1., 1., 1.]])) # doctest: +ELLIPSIS
 |      (array([[ 0.5]]), array([[2]]...))
 |      
 |      As you can see, it returns [[0.5]], and [[2]], which means that the
 |      element is at distance 0.5 and is the third element of samples
 |      (indexes start at 0). You can also query for multiple points:
 |      
 |      >>> X = [[0., 1., 0.], [1., 0., 1.]]
 |      >>> neigh.kneighbors(X, return_distance=False) # doctest: +ELLIPSIS
 |      array([[1],
 |             [2]]...)
 |  
 |  kneighbors_graph(self, X=None, n_neighbors=None, mode='connectivity')
 |      Computes the (weighted) graph of k-Neighbors for points in X
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_query, n_features),                 or (n_query, n_indexed) if metric == 'precomputed'
 |          The query point or points.
 |          If not provided, neighbors of each indexed point are returned.
 |          In this case, the query point is not considered its own neighbor.
 |      
 |      n_neighbors : int
 |          Number of neighbors for each sample.
 |          (default is value passed to the constructor).
 |      
 |      mode : {'connectivity', 'distance'}, optional
 |          Type of returned matrix: 'connectivity' will return the
 |          connectivity matrix with ones and zeros, in 'distance' the
 |          edges are Euclidean distance between points.
 |      
 |      Returns
 |      -------
 |      A : sparse matrix in CSR format, shape = [n_samples, n_samples_fit]
 |          n_samples_fit is the number of samples in the fitted data
 |          A[i, j] is assigned the weight of edge that connects i to j.
 |      
 |      Examples
 |      --------
 |      >>> X = [[0], [3], [1]]
 |      >>> from sklearn.neighbors import NearestNeighbors
 |      >>> neigh = NearestNeighbors(n_neighbors=2)
 |      >>> neigh.fit(X) # doctest: +ELLIPSIS
 |      NearestNeighbors(algorithm='auto', leaf_size=30, ...)
 |      >>> A = neigh.kneighbors_graph(X)
 |      >>> A.toarray()
 |      array([[ 1.,  0.,  1.],
 |             [ 0.,  1.,  1.],
 |             [ 1.,  0.,  1.]])
 |      
 |      See also
 |      --------
 |      NearestNeighbors.radius_neighbors_graph
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.neighbors.base.SupervisedIntegerMixin:
 |  
 |  fit(self, X, y)
 |      Fit the model using X as training data and y as target values
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix, BallTree, KDTree}
 |          Training data. If array or matrix, shape [n_samples, n_features],
 |          or [n_samples, n_samples] if metric='precomputed'.
 |      
 |      y : {array-like, sparse matrix}
 |          Target values of shape = [n_samples] or [n_samples, n_outputs]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

위 도움말에서 주의해서 봐야 할 옵션은 weights 와 metric 이다.
weights 옵션은 가까이에 있는 점 일수록 가중치를 준다고 생각하면 된다. k=3 에서 거리가 각각 (1,2,3) 이라면 가중치가 거리의 역수인 경우라면 (1/1, 1/2, 1/3) 이 된다. 첫번째 점이 class1, 두번째와 세번째 점이 class2 인 경우, 1/1 > 1/2 + 1/3 이므로 class1 으로 판정한다.
metric 옵션은 두 점 사이의 거리를 재는 방식이다. 기존의 거리와 완전히 다른 새로운 방식을 적용하면 상당히 신기한 모델을 만들 수 있다. (예를 들어 너무 가까이에 있는 점들은 거리를 무한대로 주면 어떻게 될까? 또는 기존 거리에 sin 함수를 적용하면 어떻게 될까?)

k-NN 의 특징¶

k 값이 커질 수록 결정경계가 부드러워지는 것을 확인할 수 있다. 왜 그럴까?
k 가 작으면 과대적합, k 가 크면 과소적합 인 경향을 보인다.
위의 예제에서, fit() 함수를 호출할 때 사실은 훈련데이터를 읽어오는 일만 하고 복잡한 작업을 거의 하지 않는다. 즉 모델을 훈련시키는 부하는 거의 없다. => 장점
하지만 한 점을 예측하려고 하면, 모든 훈련데이터의 점들과 거리를 비교해야 한다. 이것은 상당히 부하가 많이 발생하는 일이므로 예측할 때 많은 시간이 소요될 수 있다. => 단점
k-NN 에서는 동점이 나오는 경우가 생긴다. 이럴 경우 알고리즘 내부적으로 판단한다.
아래에서 k 값이 변할 때, 훈련세트와 테스트세트에서 점수가 어떻게 변하는 지를 보여준다.

train_scores = []
test_scores = []

for i in range(1,31):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, y_train)

    score1 = model.score(X_train, y_train)
    score2 = model.score(X_test, y_test)
    
    train_scores.append(score1)
    test_scores.append(score2)
    
plt.plot(range(1,31),train_scores,'bo-',label='train scores')
plt.plot(range(1,31),test_scores,'r*-',label='test scores')
plt.ylim(0.8,1.)
plt.legend(loc='lower center')

<matplotlib.legend.Legend at 0x282ed3a2160>

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

sklearn 기본 틀 (0)	2019.02.26
Numpy를 활용한 수치근사법 (0)	2019.02.25
지도학습개요 (0)	2019.02.25
머신러닝 기초_비용함수 (0)	2019.02.22
머신러닝 기초 _ 거리 (0)	2019.02.22

조환희의 학습 블로그

티스토리 뷰

지도학습 - k-NN분류

k-NN 분류 (최근접 이웃)¶

k-NN 의 특징¶

'beginner > 파이썬 머신러닝 기초' 카테고리의 다른 글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31