from IPython.core.display import display, HTML
display(HTML("<style> .container{width:90% !important;}</style>"))

1. Intro¶

Decision Tree(결정트리) :¶

데이터에 있는 규칙을 학습을 통해 자동으로 찾아내 트리 기반의 분류 규칙을 만드는 알고리즘입니다.
(조금 더 쉽게 하자면 if else를 자동으로 찾아내 예측을 위한 규칙을 만드는 알고리즘입니다.)

Decision Tree의 구조¶

루트노드(Root Node) : 시작점
리프노드(Leaf Node) : 결정된 클래스 값
규칙노드/내부노드(Decision Node / Internal Node) : 데이터세트의 피처가 결합해 만들어진 분류를 위한 규칙조건

타이타닉 예제에서의 Decision Tree 예시¶

위의 그림은 타이타닉 예제를 통해 Decision Tree의 구조를 간략히 나타낸 것으로 아래와 같이 노드를 분류할 수 있습니다.

Is Passenger Male? : 루트노드
Age < 18? , 3rd Class?, Embarked from Southhampton? : 규칙노드
Died, Survived : 리프노드

하지만 Decision Tree에서 많은 규칙이 있다는 것은 분류 방식이 복잡해진다는 것이고
이는 과적합(Overfitting)으로 이어지기 쉽습니다.
(트리의 깊이(depth)가 깊어질수록 결정트리는 과적합되기 쉬워 예측 성능이 저하될 수 있습니다.)

가능한 적은 규칙노드로 높은 성능을 가지려면 데이터 분류를 할 때
최대한 많은 데이터 세트가 해당 분류에 속할 수 있도록 규칙 노드의 규칙이 정해져야 합니다.
이를 위해 최대한 균일한 데이터 세트가 구성되도록 분할(Split)하는 것이 필요합니다.
(분할된 데이터가 특정 속성을 잘 나타내야 한다는 것입니다.)

규칙 노드는 정보균일도가 높은 데이터 세트로 쪼개지도록 조건을 찾아 서브 데이터 세트를 만들고,
이 서브 데이터에서 이런 작업을 반복하며 최종 클래스를 예측하게 됩니다.

사이킷런에서는 기본적으로 지니계수를 이용하여 데이터를 분할합니다.

※ 지니계수 : 경제학에서 불평등지수를 나타낼 때 사용하는 것으로 0일 때 완전 평등, 1일 때 완전 불평등을 의미합니다.

머신러닝에서는 데이터가 다양한 값을 가질수록 평등하며 특정 값으로 쏠릴 때 불평등한 값이 됩니다.
즉, 다양성이 낮을수록 균일도가 높다는 의미로 1로 갈수록 균일도가 높아 지니계수가 높은 속성을 기준으로 분할

2. Decision Tree의 장단점¶

장점¶

쉽고 직관적입니다.
각 피처의 스케일링과 정규화 같은 전처리 작업의 영향도가 크지 않습니다.

단점¶

규칙을 추가하며 서브트리를 만들어 나갈수록 모델이 복잡해지고, 과적합에 빠지기 쉽습니다.
→ 트리의 크기를 사전에 제한하는 튜닝이 필요합니다.

3. Decision Tree Classifier의 파라미터¶

파라미터 명	설명
min_samples_split	- 노드를 분할하기 위한 최소한의 샘플 데이터수 → 과적합을 제어하는데 사용 - Default = 2 → 작게 설정할 수록 분할 노드가 많아져 과적합 가능성 증가
min_samples_leaf	- 리프노드가 되기 위해 필요한 최소한의 샘플 데이터수 - min_samples_split과 함께 과적합 제어 용도 - 불균형 데이터의 경우 특정 클래스의 데이터가 극도로 작을 수 있으므로 작게 설정 필요
max_features	- 최적의 분할을 위해 고려할 최대 feature 개수 - Default = None → 데이터 세트의 모든 피처를 사용 - int형으로 지정 →피처 갯수 / float형으로 지정 →비중 - sqrt 또는 auto : 전체 피처 중 √(피처개수) 만큼 선정 - log : 전체 피처 중 log2(전체 피처 개수) 만큼 선정
max_depth	- 트리의 최대 깊이 - default = None → 완벽하게 클래스 값이 결정될 때 까지 분할 또는 데이터 개수가 min_samples_split보다 작아질 때까지 분할 - 깊이가 깊어지면 과적합될 수 있으므로 적절히 제어 필요
max_leaf_nodes	리프노드의 최대 개수

4. Decision Tree모델의 시각화¶

사이킷런의 붓꽃 데이터 세트를 이용한 DecisionTree 시각화¶

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# DecicionTreeClassifier 생성
dt_clf = DecisionTreeClassifier(random_state=156)

# 붓꽃 데이터를 로딩하고, 학습과 테스트 데이터 세트로 분리
iris_data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.2, random_state=11)

# DecisionTreeClassifier 학습
dt_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=156,
            splitter='best')

from sklearn.tree import export_graphviz

# export_graphviz( )의 호출 결과로 out_file로 지정된 tree.dot 파일을 생성함
export_graphviz(dt_clf, out_file="tree.dot", class_names = iris_data.target_names, 
                           feature_names = iris_data.feature_names, impurity=True, filled=True)

print('===============max_depth의 제약이 없는 경우의 Decision Tree 시각화==================')
import graphviz
# 위에서 생성된 tree.dot 파일을 Graphiviz 가 읽어서 시각화
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

===============max_depth의 제약이 없는 경우의 Decision Tree 시각화==================

petal length(cm) <= 2.45 와 같이 조건이 있는 것은 자식 노드를 만들기 위한 규칙 조건으로 이런 것이 없는 것은 리프노드입니다.
gini는 다음의 value = [ ] 로 주어진 데이터 분포에서의 지니계수
samples : 현 규칙에 해당하는 데이터 건수
value = [ ] 클래스 값 기반의 데이터 건수 ( 이번 예제의 경우 0: Setosa, 1 : Veericolor, 2: Virginia 를 나타냄 )

# DecicionTreeClassifier 생성 (max_depth = 3 으로 제한)
dt_clf = DecisionTreeClassifier(max_depth=3 ,random_state=156)
dt_clf.fit(X_train, y_train)

# export_graphviz( )의 호출 결과로 out_file로 지정된 tree.dot 파일을 생성함
export_graphviz(dt_clf, out_file="tree.dot", class_names = iris_data.target_names, 
                           feature_names = iris_data.feature_names, impurity=True, filled=True)

print('===============max_depth=3인 경우의 Decision Tree 시각화==================')
import graphviz
# 위에서 생성된 tree.dot 파일을 Graphiviz 가 읽어서 시각화
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

===============max_depth=3인 경우의 Decision Tree 시각화==================

max_depth를 3으로 제한한 결과 처음보다 간결한 형태의 트리가 만들어졌습니다.

# DecicionTreeClassifier 생성 (min_samples_split=4로 상향)
dt_clf = DecisionTreeClassifier(min_samples_split=4 ,random_state=156)
dt_clf.fit(X_train, y_train)

# export_graphviz( )의 호출 결과로 out_file로 지정된 tree.dot 파일을 생성함
export_graphviz(dt_clf, out_file="tree.dot", class_names = iris_data.target_names, 
                           feature_names = iris_data.feature_names, impurity=True, filled=True)

print('===============min_samples_split=4인 경우의 Decision Tree 시각화==================')
import graphviz
# 위에서 생성된 tree.dot 파일을 Graphiviz 가 읽어서 시각화
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

===============min_samples_split=4인 경우의 Decision Tree 시각화==================

sample = 3 인 경우 샘플 내 상이한 값이 있어도 처음과 달리 더 이상 분할하지 않게 되어 트리의 깊이가 줄어들었습니다.

# DecicionTreeClassifier 생성 (min_samples_leaf=4로 상향)
dt_clf = DecisionTreeClassifier(min_samples_leaf=4 ,random_state=156)
dt_clf.fit(X_train, y_train)

# export_graphviz( )의 호출 결과로 out_file로 지정된 tree.dot 파일을 생성함
export_graphviz(dt_clf, out_file="tree.dot", class_names = iris_data.target_names, 
                           feature_names = iris_data.feature_names, impurity=True, filled=True)

print('===============min_samples_leaf=4인 경우의 Decision Tree 시각화==================')
import graphviz
# 위에서 생성된 tree.dot 파일을 Graphiviz 가 읽어서 시각화
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

===============min_samples_leaf=4인 경우의 Decision Tree 시각화==================

자식이 없는 리프노드는 클래스 결정 값이 되는데 min_samples_leaf 는 리프노드가 될 수 있는 샘플 데이터의 최소 갯수를 지정합니다.

위와 비교해보면 기존에 샘플갯수가 3이하이던 리프노드들이 샘플갯수가 4가 되도로 변경되었음을 볼 수 있습니다.
결과적으로 처음보다 트리가 간결해졌습니다.

Feature Importance 시각화¶

학습을 통해 규칙을 정하는 데 있어 피처의 중요도를 DecisionTreeClassifier 객체의 featureimportances 속성으로 확인할 수 있습니다.
→기본적으로 ndarray형태로 값을 반환하며 피처 순서대로 값이 할당

import seaborn as sns
import numpy as np
%matplotlib inline

# feature importance 추출
print("Feature Importances:\n{0}\n".format(np.round(dt_clf.feature_importances_, 3)))

# feature 별 feature importance 매핑
for name, value in zip(iris_data.feature_names, dt_clf.feature_importances_):
    print('{0}: {1:.3f}'.format(name, value))
    
# feature importance 시각화
sns.barplot(x=dt_clf.feature_importances_, y=iris_data.feature_names)

Feature Importances:
[0.006 0.    0.546 0.448]

sepal length (cm): 0.006
sepal width (cm): 0.000
petal length (cm): 0.546
petal width (cm): 0.448

<matplotlib.axes._subplots.AxesSubplot at 0x1a1eaa0c88>

5. Decision Tree의 과적합(Overfitting)¶

임의의 데이터 세트를 통한 과적합 문제 시각화¶

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

plt.title("3 Class values with 2 Features Sample Data Creation")

# 2차원 시각화를 위해 피처는 2개, 클래스는 3가지 유형의 분류 샘플 데이터 생성
X_features, y_labels = make_classification(n_features=2, n_redundant=0, n_informative=2,
                                                                  n_classes=3, n_clusters_per_class=1, random_state=0)

# 그래프 형태로 2개의 피쳐로 2차원 좌표 시각화, 각 클래스 값은 다른 색으로 표시
plt.scatter(X_features[:, 0], X_features[:, 1], marker='o', c=y_labels, s=25, edgecolor = 'k', cmap='rainbow')

<matplotlib.collections.PathCollection at 0x1a1f91a940>

우선 트리 생성 시 파라미터를 디폴트로 놓고, 데이터가 어떻게 분류되는지 확인

# Classifier의 Decision Boundary를 시각화 하는 함수
def visualize_boundary(model, X, y):
    fig,ax = plt.subplots()
    
    # 학습 데이타 scatter plot으로 나타내기
    ax.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='rainbow', edgecolor='k',
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim_start , xlim_end = ax.get_xlim()
    ylim_start , ylim_end = ax.get_ylim()
    
    # 호출 파라미터로 들어온 training 데이타로 model 학습 . 
    model.fit(X, y)
    # meshgrid 형태인 모든 좌표값으로 예측 수행. 
    xx, yy = np.meshgrid(np.linspace(xlim_start,xlim_end, num=200),np.linspace(ylim_start,ylim_end, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    # contourf() 를 이용하여 class boundary 를 visualization 수행. 
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap='rainbow', clim=(y.min(), y.max()),
                           zorder=1)

# 특정한 트리 생성에 제약이 없는(전체 default 값) Decision Tree의 학습과 결정 경계 시각화
dt_clf = DecisionTreeClassifier().fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

/anaconda3/lib/python3.7/site-packages/matplotlib/contour.py:1000: UserWarning: The following kwargs were not used by contour: 'clim'
  s)

위의 경우 매우 얇은 영역으로 나타난 부분은 이상치에 해당하는데, 이런 이상치까지 모두 분류하기 위해 분할한 결과 결정 기준 경계가 많아졌습니다.
→이런 경우 조금만 형태가 다른 데이터가 들어와도 정확도가 매우 떨어지게 됩니다.

# min_samples_leaf = 6 으로 설정한 Decision Tree의 학습과 결정 경계 시각화
dt_clf = DecisionTreeClassifier(min_samples_leaf=6).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

/anaconda3/lib/python3.7/site-packages/matplotlib/contour.py:1000: UserWarning: The following kwargs were not used by contour: 'clim'
  s)

default 값으로 실행한 앞선 경우보다 이상치에 크게 반응하지 않으면서 일반화된 분류 규칙에 의해 분류되었음을 확인할 수 있습니다.

Decision Tree의 과적합을 줄이기 위한 파라미터 튜닝¶

(1) max_depth 를 줄여서 트리의 깊이 제한
(2) min_samples_split 를 높여서 데이터가 분할하는데 필요한 샘플 데이터의 수를 높이기
(3) min_samples_leaf 를 높여서 말단 노드가 되는데 필요한 샘플 데이터의 수를 높이기
(4) max_features를 높여서 분할을 하는데 고려하는 feature의 수 제한

6. Decision Tree 실습¶

사용자 행동 인식 데이터 세트¶

https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

30명에게 스마트폰 센서를 장착한 뒤 사람의 동작과 관련된 여러 가지 피처를 수집한 데이터
→ 수집된 피처 세트를 기반으로 어떠한 동작인지 예측

feature_info.txt 과 README.txt : 데이터 세트와 피처에 대한 간략한 설명
features.txt : 피처의 이름 기술
activity_labels.txt : 동작 레이블 값에 대한 설명

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# 데이터셋을 구성하는 함수 설정
def get_human_dataset():
    
    # 각 데이터 파일들은 공백으로 분리되어 있으므로 read_csv에서 공백문자를 sep으로 할당
    feature_name_df = pd.read_csv('human_activity/features.txt', sep='\s+',
                                                     header=None, names=['column_index', 'column_name'])
    # 데이터프레임에 피처명을 컬럼으로 뷰여하기 위해 리스트 객체로 다시 반환
    feature_name = feature_name_df.iloc[:, 1].values.tolist()
    
    # 학습 피처 데이터세트와 테스트 피처 데이터를 데이터프레임으로 로딩
    # 컬럼명은 feature_name 적용
    X_train = pd.read_csv('human_activity/train/X_train.txt', sep='\s+', names=feature_name)
    X_test = pd.read_csv('human_activity/test/X_test.txt', sep='\s+', names=feature_name)
    
    # 학습 레이블과 테스트 레이블 데이터를 데이터 프레임으로 로딩, 컬럼명은 action으로 부여
    y_train = pd.read_csv('human_activity/train/y_train.txt', sep='\s+', names=['action'])
    y_test = pd.read_csv('human_activity/test/y_test.txt', sep='\s+', names=['action'])
    
    # 로드된 학습/테스트용 데이터프레임을 모두 반환
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = get_human_dataset()

print('## 학습 피처 데이터셋 info()')
X_train.info()

## 학습 피처 데이터셋 info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7352 entries, 0 to 7351
Columns: 561 entries, tBodyAcc-mean()-X to angle(Z,gravityMean)
dtypes: float64(561)
memory usage: 31.5 MB

학습 데이터 셋은 7352개의 레코드와 561개의 피처를 가지고 있습니다.

X_train.head(3)

y_train['action'].value_counts()

6    1407
5    1374
4    1286
1    1226
2    1073
3     986
Name: action, dtype: int64

레이블 값은 1, 2, 3, 4, 5, 6 의 값을 가지고 있으며 고르게 분포되어 있습니다.

DecisionClassifier 파라미터를 default로 예측 수행¶

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 예제 반복시마다 동일한 예측 결과 도출을 위해 난수값(random_state) 설정
dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('Decision Tree 예측 정확도 : {0:.4f}'.format(accuracy))

# DecisionTreeClassifier의 하이퍼 파리미터 추출
print('\nDecisionTreeClassifier 기본 하이퍼파라미터:\n', dt_clf.get_params())

Decision Tree 예측 정확도 : 0.8548

DecisionTreeClassifier 기본 하이퍼파라미터:
 {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 156, 'splitter': 'best'}

모든 파라미터들을 default를 두고 학습한 결과 약 85.48%의 정확도를 기록했습니다.

Decision Tree의 max_depth가 정확도에 주는 영향¶

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [6, 8, 10, 12, 16, 20, 24]
         }

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train, y_train)
print('GridSearchCV 최고 평균 정확도 수치: {:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV 최적 하이퍼파라미터: ', grid_cv.best_params_)

# GridSearchCV 객체의 cv_results_ 속성을 데이터 프레임으로 생성
scores_df = pd.DataFrame(grid_cv.cv_results_)
scores_df[['rank_test_score', 'params','mean_train_score', 'mean_test_score',  'split0_test_score',
           'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score']]

Fitting 5 folds for each of 7 candidates, totalling 35 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  35 out of  35 | elapsed:  1.3min finished

GridSearchCV 최고 평균 정확도 수치: 0.8526
GridSearchCV 최적 하이퍼파라미터:  {'max_depth': 8}

Decision Tree의 max_depth가 커질수록 학습정확도는 높아지지만 테스트 데이터셋의 정확도는 max_depth = 8 일 때 가장 높습니다.
→ max_depth를 너무 크게 설정하면 과적합으로 인해 성능이 오히려 하락하게 됩니다.

# GridSearch가 아닌 별도의 테스트 데이터셋에서 max_depth별 성능 측정
max_depths = [6, 8, 10, 12, 16, 20, 24]

for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156)
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print('max_depth = {0} 정확도 : {1:.4f}'.format(depth, accuracy))

max_depth = 6 정확도 : 0.8558
max_depth = 8 정확도 : 0.8707
max_depth = 10 정확도 : 0.8673
max_depth = 12 정확도 : 0.8646
max_depth = 16 정확도 : 0.8575
max_depth = 20 정확도 : 0.8548
max_depth = 24 정확도 : 0.8548

이 경우에도 max_depth = 8 일 때 가장 높은 정확도를 나타냅니다.
→ max_depth가 너무 커지면 과적합에 빠져 성능이 떨어지게 됩니다. 즉, 너무 복잡한 모델보다 깊이를 낮춘 단순한 모델이 효과적일 수 있습니다.

Decision Tree의 max_depth와 min_samples_split 를 같이 변경하며 성능 튜닝¶

params = {
    'max_depth' : [6, 8, 10, 12, 16, 20, 24],
    'min_samples_split' : [16, 24]
}

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train, y_train)
print('GridSearchCV 최고 평균 정확도 수치: {:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV 최적 하이퍼파라미터: ', grid_cv.best_params_)

# GridSearchCV 객체의 cv_results_ 속성을 데이터 프레임으로 생성
scores_df = pd.DataFrame(grid_cv.cv_results_)
scores_df[['rank_test_score', 'params','mean_train_score', 'mean_test_score',  'split0_test_score', 
           'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score']]

Fitting 5 folds for each of 14 candidates, totalling 70 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  70 out of  70 | elapsed:  2.7min finished

GridSearchCV 최고 평균 정확도 수치: 0.8550
GridSearchCV 최적 하이퍼파라미터:  {'max_depth': 8, 'min_samples_split': 16}

max_depth = 8, min_samples_split = 16일 때 평균 정확도 85.5% 정도로 가장 높은 수치를 나타냈습니다.

해당 파라미터를 적용하여 예측 수행

best_df_clf = grid_cv.best_estimator_
pred1 = best_df_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred1)
print('Desicion Tree 예측 정확도: {0:.4f}'.format(accuracy))

Desicion Tree 예측 정확도: 0.8717

max_depth = 8, min_samples_split = 16일 때 정확도 87.17% 정도의 정확도를 기록했습니다.

Decision Tree의 각 피처의 중요도 시각화 : featureimportances¶

import seaborn as sns

feature_importance_values = best_df_clf.feature_importances_
# Top 중요도로 정렬하고, 쉽게 시각화하기 위해 Series 변환
feature_importances = pd.Series(feature_importance_values, index=X_train.columns)
# 중요도값 순으로 Series를 정렬
feature_top20 = feature_importances.sort_values(ascending=False)[:20]

plt.figure(figsize=[8, 6])
plt.title('Feature Importances Top 20')
sns.barplot(x=feature_top20, y=feature_top20.index)
plt.show()

	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z	tBodyAcc-std()-X	tBodyAcc-std()-Y	tBodyAcc-std()-Z	tBodyAcc-mad()-X	tBodyAcc-mad()-Y	tBodyAcc-mad()-Z	tBodyAcc-max()-X	...	fBodyBodyGyroJerkMag-meanFreq()	fBodyBodyGyroJerkMag-skewness()	fBodyBodyGyroJerkMag-kurtosis()	angle(tBodyAccMean,gravity)	angle(tBodyAccJerkMean),gravityMean)	angle(tBodyGyroMean,gravityMean)	angle(tBodyGyroJerkMean,gravityMean)	angle(X,gravityMean)	angle(Y,gravityMean)	angle(Z,gravityMean)
0	0.288585	-0.020294	-0.132905	-0.995279	-0.983111	-0.913526	-0.995112	-0.983185	-0.923527	-0.934724	...	-0.074323	-0.298676	-0.710304	-0.112754	0.030400	-0.464761	-0.018446	-0.841247	0.179941	-0.058627
1	0.278419	-0.016411	-0.123520	-0.998245	-0.975300	-0.960322	-0.998807	-0.974914	-0.957686	-0.943068	...	0.158075	-0.595051	-0.861499	0.053477	-0.007435	-0.732626	0.703511	-0.844788	0.180289	-0.054317
2	0.279653	-0.019467	-0.113462	-0.995380	-0.967187	-0.978944	-0.996520	-0.963668	-0.977469	-0.938692	...	0.414503	-0.390748	-0.760104	-0.118559	0.177899	0.100699	0.808529	-0.848933	0.180637	-0.049118

	rank_test_score	params	mean_train_score	mean_test_score	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score
0	4	{'max_depth': 6}	0.944848	0.850925	0.814111	0.873555	0.819728	0.865895	0.881471
1	1	{'max_depth': 8}	0.982693	0.852557	0.820896	0.827328	0.855102	0.868618	0.891008
2	4	{'max_depth': 10}	0.993403	0.850925	0.799864	0.813052	0.863265	0.891082	0.887602
3	7	{'max_depth': 12}	0.997212	0.844124	0.795115	0.813052	0.848980	0.877468	0.886240
4	2	{'max_depth': 16}	0.999660	0.852149	0.799864	0.822570	0.853061	0.887679	0.897820
5	3	{'max_depth': 20}	0.999966	0.851605	0.803256	0.822570	0.856463	0.877468	0.898501
6	6	{'max_depth': 24}	1.000000	0.850245	0.796472	0.822570	0.856463	0.877468	0.898501

	rank_test_score	params	mean_train_score	mean_test_score	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score
0	10	{'max_depth': 6, 'min_samples_split': 16}	0.944202	0.847797	0.814111	0.868797	0.819728	0.866576	0.869891
1	12	{'max_depth': 6, 'min_samples_split': 24}	0.943589	0.846708	0.809362	0.868797	0.819728	0.865895	0.869891
2	1	{'max_depth': 8, 'min_samples_split': 16}	0.979802	0.855005	0.806649	0.830727	0.860544	0.874745	0.902589
3	4	{'max_depth': 8, 'min_samples_split': 24}	0.978204	0.851469	0.807327	0.830727	0.857143	0.872022	0.890327
4	3	{'max_depth': 10, 'min_samples_split': 16}	0.987419	0.852829	0.805292	0.817131	0.866667	0.884275	0.891008
5	2	{'max_depth': 10, 'min_samples_split': 24}	0.984188	0.854189	0.810719	0.819850	0.869388	0.881552	0.889646
6	14	{'max_depth': 12, 'min_samples_split': 16}	0.989391	0.845892	0.798507	0.811013	0.851020	0.884275	0.884877
7	13	{'max_depth': 12, 'min_samples_split': 24}	0.985753	0.846300	0.791723	0.820530	0.855782	0.880871	0.882834
8	11	{'max_depth': 16, 'min_samples_split': 16}	0.990445	0.847252	0.801221	0.815772	0.858503	0.876787	0.884196
9	5	{'max_depth': 16, 'min_samples_split': 24}	0.986739	0.849565	0.805970	0.821210	0.854422	0.878148	0.888283
10	8	{'max_depth': 20, 'min_samples_split': 16}	0.990445	0.848749	0.798507	0.815772	0.858503	0.876787	0.894414
11	6	{'max_depth': 20, 'min_samples_split': 24}	0.986739	0.849293	0.805292	0.821210	0.854422	0.878148	0.887602
12	8	{'max_depth': 24, 'min_samples_split': 16}	0.990445	0.848749	0.798507	0.815772	0.858503	0.876787	0.894414
13	6	{'max_depth': 24, 'min_samples_split': 24}	0.986739	0.849293	0.805292	0.821210	0.854422	0.878148	0.887602

[Chapter 4. 분류] 랜덤포레스트(Random Forest) (1)	2019.10.19
[Chapter 4. 분류] 앙상블 학습 (0)	2019.10.14
[Chapter 3. 평가] 피마 인디언 당뇨병 데이터셋을 통한 평가지표 실습 (0)	2019.10.03
[Chapter 3. 평가] 머신러닝 성능 평가에 활용되는 지표들 (0)	2019.10.02
[Chapter 2. 사이킷런을 이용한 머신러닝] 타이타닉 경진대회 실습 (3)	2019.10.02

데이터분석, 머신러닝 정리 노트

[Chapter 4. 분류] Decision Tree Classifier