지도학습 개요¶

기본 개념¶

지도학습(supervised learning)은 샘플 데이터가 있고, 각 샘플마다 레이블(타겟값 혹은 목표값)이 지정되어 있는 학습 방법이다.
기존에 가지고 있는 데이터와 레이블을 가지고 학습을 하고 나면, 새로운 샘플 데이터에 대해 레이블을 예측할 수 있다.
레이블이 연속적이거나 실수값인 경우는 회귀(regression) 문제가 되고, 레이블이 예/아니오, 고양이/개/원숭이 와 같이 여러 클래스 중에 하나인 경우를 분류(classification) 문제라고 한다.
분류의 경우에 예/아니오 와 같이 둘 중에 하나를 예측하는 것을 이진 분류 라고 하고, 고양이/개/원숭이 와 같이 셋 이상의 클래스를 구분하는 것을 다중 분류 라고 한다.

중요한 것은 새로운 데이터에 대해 잘 예측하는 것이다¶

주어진 데이터에 대해 잘 들어맞는 아무리 훌륭한 예측 모델을 만들었다 하더라도, 새로운 데이터에 대한 예측 능력이 떨어진다면 아무런 의미가 없게 된다.
머신러닝은 기본적으로 주어진 데이터와 새로운 데이터가 비슷한 패턴을 가질 것이라는 가정에 기반하고 있다. 다만 그렇게 되려면 주어진 데이터가 발생 가능한 모든 데이터를 대표할 수 있는 샘플들로 이루어져야 할 것이다. 하지만 이것은 현실적으로 불가능하다.
이러한 문제는 주어진 데이터가 과연 일반성을 가지는가 라는 질문이 된다. 주어진 데이터가 일반성을 가지기 위한 가장 좋은 방법은 데이터를 아주 많이 확보하는 것이다. 발생 가능한 샘플 수가 10,000개 인데, 7,000개 정도의 데이터를 확보했다면 아주 정확한 모델을 만들 수 있는 것은 자명하다. 하지만 현실적으로 충분한 데이터를 확보하는 것은 머신러닝에서 가장 어려운 문제이다.
또한 충분한 데이터를 확보했다 하더라도, 그 데이터 안에는 오류인 데이터도 있고 아주 드물게 발생하는 이상치도 포함되어 있을 수 밖에 없다. 그러므로 주어진 데이터가 많던 적던 너무 주어진 데이터에 딱 들어맞는 예측모델을 만드는 것은 예측 능력을 떨어트릴 수 있다.
이와 같이, 너무 지나치게 훈련 데이터에 치우친 복잡한 모델은 과대적합(overfitting) 되었다고 하고, 너무 지나치게 예측 모델을 단순하게 만들어 훈련 데이터의 특징을 제대로 나타내지 못할 때 과소적합(underfitting) 이라고 한다.

그러므로 우리의 목표인 새로운 데이터에 대해 잘 예측하기 위해서는, 가능한 한 많은 데이터를 수집하되 너무 지나치게 확보한 데이터에 치우친 복잡한 예측모델을 만들지 않고 최대한 일반화 할 수 있는 예측모델을 만들 수 있는 능력을 가져야 한다.
이를 위해 가장 좋은 방법은 충분한 데이터로 예측 모델을 만든 다음, 이 모델로 새로 발생하는 많은 데이터로 테스트를 해 보는 것이다. 하지만 새로운 데이터를 확보하는 것은 미래의 일이어서 많은 시간이 필요한 일이다. 그리고 대부분의 문제에 있어서는 즉시 결과를 볼 수 있어야 한다.
이러한 과대적합/과소적합/일반성 의 문제를 해결하기 위해, 주어진 데이터에서 일부분은 훈련세트 로 하고, 나머지는 테스트세트 로 구성하여 훈련세트로 예측모델을 만든 다음에 테스트세트로 이 예측모델을 검증 하는 절차를 밟게 된다.

참고 : 1장에서 배운 train_test_split() 함수가 이에 해당한다.
이럴 경우 훈련세트는 과거에 발생해서 내가 확보하고 있는 데이터에 해당하고, 테스트세트는 미래에 새로 발생할 데이터에 해당한다고 하겠다.
앞에서 충분한 데이터량을 확보하는 것이 가장 중요하다고 했다. 그렇다면 얼마나 충분한 데이터를 확보해야 할까? 이것은 해결하려는 문제마다 다르고 데이터의 성격에 따라 다르기 때문에 일반적인 답을 하기는 쉽지 않다. 그리고 중요한 연구 이슈이기도 하다. 한 가지 고려할 점은 속성(열)의 갯수가 늘어날 때 마다 그 만큼 데이터량도 늘어나야 한다는 점이다.
결론적으로, 충분히 복잡하면서도 가능한 단순한 예측모델이 필요하다.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

iris = load_iris()

dir(iris)

['DESCR', 'data', 'feature_names', 'target', 'target_names']

iris.data.shape, iris.target.shape

((150, 4), (150,))

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                        shuffle=False)

X_train.shape # (112, 4)
X_test.shape # (38, 4)

(38, 4)

X_train[y_train==2].shape
np.bincount(y_train)

array([50, 50, 12], dtype=int64)

np.bincount(y_test)

array([ 0,  0, 38], dtype=int64)

help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float, int, None, optional
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. By default, the value is set to 0.25.
        The default will change in version 0.21. It will remain 0.25 only
        if ``train_size`` is unspecified, otherwise it will complement
        the specified ``train_size``.
    
    train_size : float, int, or None, default None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.
    
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
    
    shuffle : boolean, optional (default=True)
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.
    
    stratify : array-like or None (default is None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.
    
    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.
    
        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.
    
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    
    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]
    
    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

stratify > 원본 데이터를 비율대로 나눠준다.

Numpy를 활용한 수치근사법 (0)	2019.02.25
지도학습 - k-NN분류 (0)	2019.02.25
머신러닝 기초_비용함수 (0)	2019.02.22
머신러닝 기초 _ 거리 (0)	2019.02.22
머신러닝 기초_iris활용 (0)	2019.02.22

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

조환희의 학습 블로그

티스토리 뷰

지도학습개요

지도학습 개요¶

기본 개념¶

중요한 것은 새로운 데이터에 대해 잘 예측하는 것이다¶

'beginner > 파이썬 머신러닝 기초' 카테고리의 다른 글

티스토리툴바