Pandas : 행과 열로 이루어진 2차원 데이터을 가공/처리할 수 있는 다양한 기능을 제공
- DataFrame : 여러 개의 행과 열로 이뤄진 2차원 데이터를 담는 데이터 구조체
- Index : RDBMS의 PK처럼 개별 데이터를 고유하게 식별하는 키 값, 데이터프레임과 시리즈 모두 인덱스를 키 값으로 가짐
- Series : 데이터프레임과 가장 큰 차이는 컬럼이 하나뿐인 데이터 구조체라는 것

판다스 시작 - 파일 로딩, 기본 API¶

# 판다스 모듈의 불러오기
import pandas as pd

판다스는 다양한 포멧으로 된 파일을 데이터프레임으로 로딩할 수 있는 편리한 API 제공
- read.csv() (콤마 구분 파일), read_table() (탭구분 파일), read_fwf() 등
read.csv() 의 인자
- sep : 구분 문자를 입력하는 인자로 default는 콤마 (예시 : 탭 구분 sep = '\t')
- filepath : 파일 경로 입력 / 파일명만 입력되면 실행파일이 있는 디렉터리와 동일한 디렉터리에 있는 파일명을 로딩

titanic_df = pd.read_csv('Titanic/input/train.csv')
print('titanic_df 변수 type: ', type(titanic_df))

titanic_df 변수 type:  <class 'pandas.core.frame.DataFrame'>

DataFrame.head() : 데이터 프레임의 맨 앞에 있는 일부 행을 출력, 기본 값으로 5개를 출력함

titanic_df.head(3)

DataFrame.shape : 데이터프레임의 행과 열을 튜플 형태로 출력

titanic_df.shape

(891, 12)

DataFrame.info() : 데이터프레임의 총 데이터건수, 데이터타입, Null 건수를 출력

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

DataFrame.describe() : 칼럼별 숫자형 데이터 값의 n-percentile분포, 평균, 최댓값, 최솟값을 나타냄 (object타입은 출력에서 제외)

→ 숫자형 컬럼에 대한 개략적인 분포를 알 수 있음
→ count : null이 아닌 데이터 건수를 나타냄

titanic_df.describe()

Series.value_counts() : 지정된 컬럼의 데이터 값 건수를 반환

→ value_counts()는 Series 형태의 데이터에만 사용이 가능(데이터프레임에는 사용이 불가)

titanic_df['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

type() : 괄호안의 데이터가 어떤 데이터 타입인지 나타내줌

type(titanic_df['Pclass'])

pandas.core.series.Series

Series는 인덱스와 단 하나의 컬럼으로 구성된 데이터 세트

titanic_df['Pclass'].head()

0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

→ 왼쪽의 0, 1, 2, 3, 4는 인덱스 번호를(원 데이터 프레임의 인덱스와 동일), 오른쪽의 3, 1, 3, 1, 3은 Series의 데이터 값을 나타냄

value_counts = titanic_df['Pclass'].value_counts()
print(type(value_counts))
print(value_counts)

<class 'pandas.core.series.Series'>
3    491
1    216
2    184
Name: Pclass, dtype: int64

→ value_counts() 에 의해 반환되는 데이터 타입 역시 Series 객체로 좌측의 3, 1, 2도 인덱스임
→ 위에서 볼 수 있듯이 인덱스는 0, 1, 2, 3 ...과 같이 순차적인 값만 가능한게 아니라 고유성이 된다면 고유 칼럼 값도 인덱스로 가질 수 있다(문자열도 가능)

데이터프레임, 리스트, 딕셔너리, 넘파이 ndarray 상호변환¶

데이터프레임은 파이썬의 리스트, 딕셔너리, 넘파이 ndarray 등 다양한 데이터로부터 생성이 가능, 그 반대 방향으로도 변환 가능
사이킷런의 많은 API들은 데이터프레임을 인자로 입력받을 수 있지만, 기본적으로 넘파이의 ndarray를 입력 인자로 사용하는 경우가 대부분

넘파이 ndarray, 리스트, 딕셔너리 --> 데이터프레임 변환¶

데이터프레임은 ndarray, 리스트와 달리 컬럼명을 가지고 있음
→ ndarray, 리스트, 딕셔너리로 데이터를 입력받고, 컬럼명 리스트로 컬럼명을 입력받아서 데이터프레임을 생성

import numpy as np

col_name1 = ['col1']
list1 = [1, 2, 3]
array1 = np.array(list1)
print('array1 shape :', array1.shape)

# 리스트를 이용해 데이터프레임 생성
df_list1 = pd.DataFrame(list1, columns=col_name1)
print('\n1차원 리스트로 만든 DataFrame: \n', df_list1)

# 넘파이 ndarray를 이용해 데이터프레임 생성
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('\n1차원 ndarray로 만든 DataFrame: \n', df_array1)

array1 shape : (3,)

1차원 리스트로 만든 DataFrame: 
    col1
0     1
1     2
2     3

1차원 ndarray로 만든 DataFrame: 
    col1
0     1
1     2
2     3

위의 예시에서는 1차원 형태 데이터를 기반으로 데이터프레임을 생성했기 때문에, 컬럼명도 하나만 가능

# 2행 3열 형태의 리스트와 ndarray를 기반으로 2차원 형태 데이터프레임 생성

#3개의 컬럼명이 필요
col_name2 = ['col1', 'col2', 'col3']

# 2행 3열 형태의 리스트와 ndarray를 생성한 뒤 데이터프레임으로 반환
list2 = [[1, 2, 3],
            [11, 12, 13]]
array2 = np.array(list2)
print('array2 shape: ', array2.shape)

df_list2 = pd.DataFrame(list2, columns = col_name2)
print('\n2차원 리스트로 만든 DataFrame: \n',df_list2)

df_array2 = pd.DataFrame(array2, columns = col_name2)
print('\n2차원 리스트로 만든 DataFrame: \n', df_array2)

array2 shape:  (2, 3)

2차원 리스트로 만든 DataFrame: 
    col1  col2  col3
0     1     2     3
1    11    12    13

2차원 리스트로 만든 DataFrame: 
    col1  col2  col3
0     1     2     3
1    11    12    13

딕셔너리 : 키는 컬럼명으로, 값은 키에 해당하는 컬럼 데이터로 전환

# 키(key)는 문자열 컬럼명으로 매핑, 값(value)은 리스트형(또는 ndarray) 컬럼 데이터로 매핑
dic = {'col1' : [1, 11], 'col2' : [2, 22], 'col3' : np.array([3, 33])}
df_dict = pd.DataFrame(dic)
print('딕셔너리로 만든 DataFrame: \n', df_dict)

딕셔너리로 만든 DataFrame: 
    col1  col2  col3
0     1     2     3
1    11    22    33

데이터프레임 --> 넘파이 ndarray, 리스트, 딕셔너리 변환¶

DataFrame.values : 데이터프레임을 넘파이 ndarray로 변환
→ 많은 머신러닝 패키지가 기본 데이터형으로 넘파이 ndarray를 사용하기 때문에 중요함

# 데이터프레임을 ndarray로 변환
array3 = df_dict.values
print('df_dict.values 타입: ', type(array3), '\ndf_dict.valuse shape: ', array3.shape)
print(array3)

df_dict.values 타입:  <class 'numpy.ndarray'> 
df_dict.valuse shape:  (2, 3)
[[ 1  2  3]
 [11 22 33]]

→ 데이터프레임의 컬럼명 없이 값만으로 ndarray가 생성됨

→ 위에서 변환된 ndarray에 tolist() 를 호출하면 리스트로 변환

# 데이터프레임을 리스트로 변환
list3 = array3.tolist()
print('df_dict.vlaues.tolist() 타입: ', type(list3))
print(list3)

df_dict.vlaues.tolist() 타입:  <class 'list'>
[[1, 2, 3], [11, 22, 33]]

DataFrame.to_dict() : 데이터프레임을 딕셔너리로 반환, 인자로 'list'를 입력하면 딕셔너리의 값이 리스트 형으로 반환

# 데이터프레임을 리스트로 변환
dict3 = df_dict.to_dict('list')
print('df_dict.to_dict() 타입: ', type(dict3))
dict3

df_dict.to_dict() 타입:  <class 'dict'>

{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

데이터프레임의 컬럼 데이터 세트 생성과 수정¶

[ ] 연산자를 이용하여 컬럼 데이터 세트의 생성과 수정 가능

# 타이타닉 데이터프레임에 Age_0 이라는 새로운 컬럼을 추가하고 일괄적으로 0을 할당
titanic_df['Age_0'] = 0
titanic_df.head(3)

# Age의 각 값에 10을 곱한 Age_by_10 컬럼 생성
titanic_df['Age_by_10'] = titanic_df['Age']*10

# Parch와 SibSp의 값과 1을 더한 Family_No 컬럼 생성
titanic_df['Family_No'] = titanic_df['Parch'] + titanic_df['SibSp']+1

titanic_df.head(3)

데이터 프레임 내의 기존 컬럼도 이같은 방법으로 쉽게 업데이트 가능

# Age_by_10 컬럼 값에 일괄적으로 + 100 처리
titanic_df['Age_by_10'] = titanic_df['Age_by_10'] + 100
titanic_df.head(3)

데이터프레임 데이터 삭제¶

drop() : 데이터삭제를 하는데 사용

DataFrame.drop(labels=None, axis=0, index=None, columns = None, level = None, inplace = False, errors='raise')

→ 이 중 가장 중요한 파라미터는 labels, axis, inplace

label : 삭제할 컬럼명 지정, 여러개를 지울 때는 리스트로 넣을 수 있음
axis = 0 : 행 드롭 --> label을 자동으로 인덱스로 간주
axis = 1 : 열 드롭 --> label에 원하는 컬럼명과 axis =1을 입력하면 지정된 컬럼을 드롭
- 주로, 컬럼 단위로 많이 삭제를 하게 되지만, 아웃라이어를 지울 때는 로우 단위로 삭제를 함
inplace : default값은 False로 자신의 데이터프레임은 삭제하지 않고, 삭제된 결과를 데이터프레임으로 반환. True로 하면 원본에서 삭제하고 반환되는 값 없음

titanic_drop_df = titanic_df.drop('Age_0', axis=1)
titanic_drop_df.head(3)

titanic_drop_df에는 'Age_0'이 삭제되었지만, titanic_df을 조회해보면 삭제되지 않았음

titanic_df.head(3)

# inplace = True 로 두어 원본 데이터프레임에서도 삭제
drop_result = titanic_df.drop(['Age_0', 'Age_by_10','Family_No'], axis=1, inplace = True)
print('inplace = True로 drop 후 반환된 값: ', drop_result)
titanic_df.head(3)

inplace = True로 drop 후 반환된 값:  None

inplace = True 는 반환하는 값이 없기 때문에 아래처럼 코드를 짜면 안됨 (원본 데이터 프레임이 그냥 날아가게 됨)

titanic_df = titanic_df.drop(['Age_0', 'Age_10', 'Family_No'], inplace = True)

### axis = 0으로 설정하여 index 0, 1, 2 로우를 삭제
titanic_df.drop([0, 1, 2], axis=0, inplace=True)
titanic_df.head(3)

Index 객체¶

판다스의 Index객체는 RDBMS의 PK와 유사하게 데이터프레임, 시리즈의 레코드를 고유하게 식별하는 객체

DataFrame.index 또는 Series.index 를 통해 인덱스 객체만 추출할 수 있음
→ 반환된 Index 객체의 값은 넘파이 1차원 ndarray와 같은 형태

# 원본파일 다시 로딩
titanic_df = pd.read_csv('Titanic/input/train.csv')

# Index 객체 추출
index = titanic_df.index

# Index 객체는 1차원 ndarray임
print('Index 객체의 array 데이터:')
index.values

Index 객체의 array 데이터:

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
       208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
       221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,
       234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,
       247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,
       260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,
       273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,
       286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,
       299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,
       312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,
       325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,
       338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,
       351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,
       364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,
       377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,
       390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,
       403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,
       416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,
       429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,
       442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,
       455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,
       468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,
       481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,
       494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,
       507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,
       520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,
       533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,
       546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,
       559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,
       572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,
       585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,
       598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,
       611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,
       624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,
       637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,
       650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,
       663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,
       676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,
       689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,
       702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,
       715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,
       728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,
       741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,
       754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,
       767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,
       780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,
       793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,
       806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,
       819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,
       832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,
       845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,
       858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,
       871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,
       884, 885, 886, 887, 888, 889, 890])

반환된 Index의 array는 ndarray와 유사하게 단일 값 반환 및 슬라이싱도 가능

print(type(index.values))
print(index.values.shape)
print(index[:5].values)
print(index.values[:5])
print(index[6])

<class 'numpy.ndarray'>
(891,)
[0 1 2 3 4]
[0 1 2 3 4]
6

한 번 만들어진 Index 객체는 함부로 변경할 수 없음.

# 인덱스는 함부로 변경할 수 없기 때문에 아래같이 인덱스 객체 값을 변경하는 작업을 수행할 수 없음
index[0] = 5

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-ace474fffa0d> in <module>
      1 # 인덱스는 함부로 변경할 수 없기 때문에 아래같이 인덱스 객체 값을 변경하는 작업을 수행할 수 없음
----> 2 index[0] = 5

/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   3936 
   3937     def __setitem__(self, key, value):
-> 3938         raise TypeError("Index does not support mutable operations")
   3939 
   3940     def __getitem__(self, key):

TypeError: Index does not support mutable operations

Series 객체는 Index 객체를 포함하지만 Series 객체에 연산 함수를 적용할 때, Index는 연산에서 제외됨 (오로지 식별용으로 사용)

series_fair = titanic_df['Fare']
print('Fair Series max 값: ', series_fair.max())
print('Fair Series sum 값: ', series_fair.sum())
print('sum() Fair Series: ', sum(series_fair))
print('Fair Series + 3: \n', (series_fair+3).head(3))

Fair Series max 값:  512.3292
Fair Series sum 값:  28693.9493
sum() Fair Series:  28693.949299999967
Fair Series + 3: 
 0    10.2500
1    74.2833
2    10.9250
Name: Fare, dtype: float64

reset_index(): 새로운 index를 연속 숫자형으로 할당하고, 기존 index는 index라는 새로운 컬럼명으로 추가
```
              → 기존 index가 연속형 int 숫자형이 아닐 경우 이를 다시 연속형 int 숫자로 만들 때 주로 사용
```
- Series에 reset_index를 사용하면 데이터프레임이 반환됨
- 파라미터 중 drop = True 로 설정하면 기존 인덱스는 새로운 컬럼으로 유지되지 않고 삭제됨

titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

print('### before reset_index ###')
value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)
print('value_counts 객체 변수 타입: ', type(value_counts))
new_value_counts = value_counts.reset_index(inplace=False)
print('\n### after reset_index ###')
print(new_value_counts)
print('new_value_counts 객체 변수 타입: ', type(new_value_counts))

### before reset_index ###
3    491
1    216
2    184
Name: Pclass, dtype: int64
value_counts 객체 변수 타입:  <class 'pandas.core.series.Series'>

### after reset_index ###
   index  Pclass
0      3     491
1      1     216
2      2     184
new_value_counts 객체 변수 타입:  <class 'pandas.core.frame.DataFrame'>

데이터 셀렉션 및 필터링¶

ix[ ] , iloc[ ], loc[ ] 를 활용해서 데이터를 추출할 수 있음

(1) DataFrame[ ]¶

DataFrame['컬럼명'] : 해당 컬럼에 해당하는 데이터를 조회, 여러 개의 컬럼을 조회할 때는 컬럼의 리스트 객체를 이용

# 단일컬럼 데이터 추출
titanic_df['Pclass'].head(3)

0    3
1    1
2    3
Name: Pclass, dtype: int64

# 여러컬럼 데이터 추출
titanic_df[['Survived', 'Pclass', 'Name']].head(3)

DataFrame[불린인덱싱] : 해당 조건에 맞는 데이터를 불린인덱싱 기반으로 불러옴, 유용하게 자주쓰는 방법

titanic_df[titanic_df['Pclass']==3].head(3)

(2) DataFrame.iloc[ row index , column index ]¶

위치 기반 인덱싱 방법으로 행과 열의 위치를 지정해서 데이터를 불러옴

titanic_df.iloc[0, 4]

'male'

위치 인덱스 슬라이싱으로도 검색 가능

titanic_df.iloc[4:7, 2:5]

리스트를 활용해 특정 위치를 지정해서도 사용 가능

titanic_df.iloc[[4,5,6], [1, 2, 5]]

# 전체 반환
titanic_df[:]

(3) DataFrame.loc[ row index , column 명 ]¶

명칭 기반으로 데이터를 추출, 행위치에는 index 값(인덱스가 숫자형이면 숫자, 문자형이면 문자)을, 열위치에는 컬렴명을 입력

titanic_df.loc[3, 'Name']

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

주의사항 :¶

일반적으로 슬라이싱 기호 ' : ' 를 적용할 때의 범위는 '시작값 ~ 끝값-1' 이지만

loc[ ]에서의 슬라이싱 결과는 '시작값~끝값'

→ 명칭기반의 인덱싱이기 때문에(명칭은 숫자형이 아닐수도 있기 때문에) -1을 할수가 없음

→ 인덱스가 숫자일 때, 범위 지정에 조심해야함

#명칭기반의 인덱싱인 loc에서의 결과
titanic_df.loc[221:224, ]

#위치기반의 인덱싱인 iloc에서의 결과
titanic_df.iloc[221:224, ]

(4) 불린 인덱싱¶

DataFrame[ ] , DataFrame.loc[ ] 에서 사용 가능

titanic_boolean = titanic_df[titanic_df['Age']>60]
print(type(titanic_boolean))
print(titanic_boolean.shape)
titanic_boolean.head(3)

<class 'pandas.core.frame.DataFrame'>
(22, 12)

[ ]에 불린 인덱싱을 적용하면 반환되는 객체도 데이터 프레임으로 원하는 컬럼명만 별도로 추출할 수 있음

titanic_df[titanic_df['Age']>60][['Name', 'Pclass','Survived']].head(3)

loc[ ] 를 이용해도 동일하게 적용이 가능

titanic_df.loc[titanic_df['Age']>60,['Name', 'Pclass', 'Survived']].head(3)

여러 개의 복합 조건을 이용해서도 불린 인덱싱이 가능

and 조건.: &
or 조건 : |
Not 조건 : ~

titanic_df[ (titanic_df['Age']>60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]

개별 조건을 변수에 할당해서, 이 변수들을 결합해서 동일한 불린 인덱싱 수행 가능

cond1 = titanic_df['Age'] > 60
cond2 = titanic_df['Pclass'] == 1
cond3 = titanic_df['Sex'] == 'female'

titanic_df[cond1 & cond2 & cond3]

정렬, Aggregate 함수, GroupBy 적용¶

DataFrame, Series의 정렬 - sort_values()¶

주요 파라미터 : by, ascending, inplace
- by : 특정 컬럼을 입력하면 해당 컬럼 기준으로 정렬 수행
- ascending : True -> 오름차순 / False -> 내림차순 (기본값은 True)
- inplace : False -> 원본 데이터프레임은 유지하고, 정렬된 데이터를 반환 / True -> 원본 데이터 프레임을 정렬 (기본값은 False)

titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)

여러 개의 컬럼으로 정렬하려면 리스트 형식으로 by에 컬럼명 입력

titanic_sorted = titanic_df.sort_values(by=['Pclass', 'Name'], ascending = False)
titanic_sorted.head(3)

Aggregation 함수 적용 : min(), max(), sum(), count()¶

데이터프레임에 바로 적용하면 모든 컬럼에 해당 aggregation을 적용

# count()는 null을 제외하고 count 함
titanic_df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

titanic_df[['Age', 'Fare']].mean()

Age     29.699118
Fare    32.204208
dtype: float64

groupby() 적용¶

groupby() 사용시 입력 파라미터 by에 컬럼을 입력하면 대상 컬럼으로 groupby 됨
데이터프레임에 groupby 를 호출하면 DataFrameGroupBy 라는 또 다른 형태의 데이터프레임을 반환

titanic_groupby = titanic_df.groupby(by='Pclass')
titanic_groupby

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e225208>

SQL groupby와는 달리 groupby()를 호출해 반환된 결과에 aggregation 함수를 적용하면 groupby 대상컬럼을 제외한 모든 컬럼에 해당 aggregation 함수를 반환

titanic_groupby = titanic_df.groupby('Pclass').count()
titanic_groupby

=> SQL에서는 여러개의 컬럼에 aggregation 함수를 적용하려면 대상 컬럼을 모두 select에 나열 했었음

  Select count(PassengerId), count(Survived), count(Name) ..... 
  from titanic 
  group by Pclass

특정컬럼에 대해서만 aggregation 함수를 적용하려면 groupby()로 반환된 DataFrameGroupBy() 객체에 해당컬럼을 지정하여 aggregation 함수 적용

titanic_groupby = titanic_df.groupby('Pclass')[['PassengerId', 'Survived']].count()
titanic_groupby

여러개의 aggregation 함수를 적용하기 위해서는 DataFrameGroupBy 객체에 agg 내에 인자로 입력해서 사용

titanic_df.groupby('Pclass')['Age'].agg(['max','min'])

여러 개의 컬럼에 서로 다른 aggregation 함수를 적용할 때 agg 인자에 딕셔너리를 활용
```
=> agg( {컬럼명: 'mean', '컬럼명2': 'max', '컬렴명3': min}) 
```

agg_format = { 'Age' : 'max', 'SibSp' : 'mean', 'Fare' : 'max'}
titanic_df.groupby('Pclass').agg(agg_format)

결손데이터 처리하기¶

판다스에서의 NULL 값은 넘파이의 NaN으로 표시
기본적으로 머신러닝 알고리즘은 NaN값을 처리하지 않으므로 이 값을 다른 값으로 대체해야 함
또한, NaN값은 평균, 총합 등의 함수 연산에서 제외됨

isnull() 또는 isna() : NaN 값 여부를 확인

titanic_df.isnull().head()

isnull().sum() 을 통해 NaN 갯수를 셀 수 있음

titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

fillna() : 결손 데이터 대체하기

titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
titanic_df.head(3)

fillna()를 이용해 반환 값을 다시 받거나, inplace=True 파라미터를 추가해야 실제 데이터 값이 변경됨에 유의

# Age컬럼의 NaN를 평균값, Embarked컬럼의 NaN 값을 S로 대체
titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df['Embarked'] = titanic_df['Embarked'].fillna('S')
titanic_df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

apply lambda 식으로 데이터 가공¶

복잡한 데이터 가공이 필요할 경우에 활용
lambda 식은 파이썬에서 함수 프로그래밍을 지원하기 위한 것

# 입력값의 제곱값을 구해서 반환하는 함수 설정
def get_square(a):
    return a**2

print('3의 제곱은: ', get_square(3))

3의 제곱은:  9

위의 함수는 def get_square(a): 와 같이 함수명과 입력인자를 먼저 선언하고 함수 내에서 입력 인자를 가공한 뒤, 결과 값을 return으로 반환

lambda는 함수의 선언과 함수 내의 처리를 한 줄의 식으로 쉽게 변환 하는 식

# 위의 함수를 lambda식으로 변환
lambda_square = lambda x : x**2
print('3의 제곱은: ', lambda_square(3))

3의 제곱은:  9

lambda x : x 2 에서 앞의 x는 입력인자, 뒤의 x2를 기반으로 한 계산식으로 호출시 이 결과가 반환

여러 개의 값을 입력인자로 사용해야할 경우, map() 함수를 결합해서 사용

a = [1, 2, 3]
squares = map(lambda x : x**2, a)
list(squares)

[1, 4, 9]

판다스 데이터프레임의 lambda 식은 파이썬의 lambda를 그대로 적용한 것으로 apply에 lambda를 적용하는 식으로 사용

# Name컬럼의 문자열 개수를 별도의 컬럼인 Name_len 에 생성
titanic_df['Name_len']  = titanic_df['Name'].apply(lambda x : len(x))
titanic_df[['Name', 'Name_len']].head(3)

# if else 절을 활용하여 나이가 15세 이하면 Child 그렇지 않으면 Adult로 구분하는 새로운 컬럼 Child_Adult를 추가
titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x < 15 else 'Adult')
titanic_df[['Age', 'Child_Adult']].head(8)

lambda식에서 if else를 지원할 때, if절의 경우 if 식보다 반환 값이 먼저 기술됨에 주의, else의 경우에는 else 뒤에 반환값이 오면 됨

lambda 식 ' : ' 기호 오른편에 반환값이 있어야 하기 때문
따라서 lambda x : if x <=15 'Child' else 'Adult' 가 아니라 lambda x : 'Child' if x<=15 else 'Adult' 가 됨

lambda 는 if, else만 지원하고 else if는 지원하지 않음, else if를 이용하기 위해서는 else 절을 () 로 내포해서 () 안에서 다시 if else를 적용
- else() 안에 다시 if else를 사용할 때에도 if 앞에 반환값을 사용함
- 하지만 이 방법은 else if 조건이 너무 많을 때는 코드가 복잡해짐

### 15세 이하는 Child, 15~60세 사이는 Adult, 61세 이상은 Elderly로 분류하는 'Age_Cat' 컬럼을 생성
titanic_df['Age_Cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x<=15 else('Adult' if x<=60 else 'Elderly'))
titanic_df['Age_Cat'].value_counts()

Adult      786
Child       83
Elderly     22
Name: Age_Cat, dtype: int64

### 나이에 따라 세분화된 분류를 수행하는 함수 생성
def get_category(age):
    cat = ''
    if age <= 5 : 
        cat = 'Baby'
    elif age<= 12 :
        cat = 'Child'
    elif age<=18 :
        cat = 'Teenager'
    elif age <=25 :
        cat = 'Student'
    elif age <=35 :
        cat = 'Young Adult'
    elif age <=60 :
        cat = 'Elderly'
    return cat

# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정
# get_category(x)는 'Age' 컬럼 값을 입력값으로 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age', 'Age_cat']].head()

titanic_df['Age_cat'] = titanic_df['Age'].apply(get_category)
titanic_df[['Age', 'Age_cat']].head()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	max	min
Pclass
1	80.0	0.92
2	70.0	0.67
3	74.0	0.42

	Age	SibSp	Fare
Pclass
1	80.0	0.416667	512.3292
2	70.0	0.402174	73.5000
3	74.0	0.615071	69.5500

[Chapter 2. 사이킷런을 이용한 머신러닝] 데이터 전처리 (0)	2019.10.02
[Chapter 2. 사이킷런을 이용한 머신러닝] Model Selection 모듈 소개 (1)	2019.10.02
[Chapter 2. 사이킷런을 이용한 머신러닝] 사이킷런의 기반 프레임워크 (0)	2019.10.01
[Chapter 2. 사이킷런을 이용한 머신러닝] (Tutorial)붓꽃 품종 예측하기 (0)	2019.10.01
[ Chapter 1. Intro ] Numpy 개요 (0)	2019.10.01

데이터분석, 머신러닝 정리 노트

[ Chapter 1. Intro ] Pandas 개요