데이터 전처리¶
- 결측치 처리
이상치 처리
데이터 : data_atype.zip (수업노트에서 다운로드)
라이브러리 및 데이터 불러오기¶
In [1]:
import pandas as pd
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')
결측치¶
In [2]:
# 데이터 샘플 확인
X_train.sample()
Out[2]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2042 | 12973 | 42.0 | Private | 195124 | 7th-8th | 4 | Married-spouse-absent | Prof-specialty | Other-relative | White | Male | 0 | 0 | 35.0 | Puerto-Rico |
In [3]:
# 결측치 컬럼 확인
X_train.isnull().sum()
Out[3]:
id 0 age 12 workclass 1662 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 1668 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 537 dtype: int64
In [4]:
# 데이터 타입 확인
X_train.info()
# 결측치가 있는 컬럼의 타입 확인 - 수치형인지 범주형(object)인지
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29304 entries, 0 to 29303 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 29304 non-null int64 1 age 29292 non-null float64 2 workclass 27642 non-null object 3 fnlwgt 29304 non-null int64 4 education 29304 non-null object 5 education.num 29304 non-null int64 6 marital.status 29304 non-null object 7 occupation 27636 non-null object 8 relationship 29304 non-null object 9 race 29304 non-null object 10 sex 29304 non-null object 11 capital.gain 29304 non-null int64 12 capital.loss 29304 non-null int64 13 hours.per.week 29291 non-null float64 14 native.country 28767 non-null object dtypes: float64(2), int64(5), object(8) memory usage: 3.4+ MB
In [6]:
# workclass 컬럼 고유 값 개수
X_train.describe(include='O')
X_train['workclass'].value_counts()
Out[6]:
Private 20451 Self-emp-not-inc 2292 Local-gov 1863 State-gov 1170 Self-emp-inc 991 Federal-gov 856 Without-pay 13 Never-worked 6 Name: workclass, dtype: int64
In [7]:
# occupation 컬럼 고유 값 개수
X_train['occupation'].value_counts()
# 결측치를 채우기에는 workclass처럼 최빈값으로 하기에는 무리가 있겠다 판단
Out[7]:
Prof-specialty 3683 Craft-repair 3677 Exec-managerial 3632 Adm-clerical 3411 Sales 3295 Other-service 2993 Machine-op-inspct 1822 Transport-moving 1434 Handlers-cleaners 1223 Farming-fishing 891 Tech-support 842 Protective-serv 590 Priv-house-serv 135 Armed-Forces 8 Name: occupation, dtype: int64
In [10]:
# native.country 컬럼 고유 값 개수
X_train['native.country'].value_counts()
Out[10]:
United-States 26240 Mexico 576 Philippines 178 Germany 120 Canada 109 Puerto-Rico 98 India 91 El-Salvador 91 Cuba 87 England 79 Jamaica 77 South 75 China 70 Italy 68 Vietnam 64 Dominican-Republic 63 Japan 56 Columbia 55 Guatemala 53 Poland 50 Taiwan 50 Haiti 40 Iran 37 Portugal 33 Nicaragua 32 Greece 27 Ecuador 26 France 26 Peru 24 Ireland 22 Hong 19 Thailand 17 Cambodia 17 Laos 17 Yugoslavia 15 Trinadad&Tobago 14 Hungary 13 Honduras 13 Scotland 12 Outlying-US(Guam-USVI-etc) 12 Holand-Netherlands 1 Name: native.country, dtype: int64
범주형 변수 결측치¶
- 삭제
- 최빈값
- 없는값
삭제¶
In [11]:
# X_train과 X_test 데이터 크기 확인
X_train.shape, X_test.shape
Out[11]:
((29304, 15), (3257, 15))
In [12]:
# 결측치 확인
X_train.isnull().sum()
Out[12]:
id 0 age 12 workclass 1662 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 1668 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 537 dtype: int64
In [14]:
# 결측치가 있는 데이터(행) 전체 삭제 및 확인 dropna() #기본값 axis=0
df = X_train.dropna()
df.isnull().sum()
Out[14]:
id 0 age 0 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 0 native.country 0 dtype: int64
In [15]:
df.shape
Out[15]:
(27096, 15)
In [16]:
# 특정컬럼에 결측치가 있으면 데이터(행) 삭제 subset=['native.country']
df = X_train.dropna(subset=['native.country'])
df.isnull().sum()
Out[16]:
id 0 age 12 workclass 1640 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 1646 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 0 dtype: int64
In [17]:
# 특정컬럼에 결측치가 있으면 데이터(행) 삭제 subset=['native.country', 'workclass']
df = X_train.dropna(subset=['native.country', 'workclass'])
df.isnull().sum()
Out[17]:
id 0 age 12 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 6 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 0 dtype: int64
In [18]:
# 결측치가 있는 컬럼 삭제 dropna(axis=1)
df= X_train.dropna(axis=1)
df.isnull().sum()
# 작업형1에서는 나올수도 있음
Out[18]:
id 0 fnlwgt 0 education 0 education.num 0 marital.status 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 dtype: int64
In [19]:
# 결측치가 많은 특정 컬럼 삭제 drop(['workclass'], axis=1)
df= X_train.drop(['workclass'], axis=1)
df.isnull().sum()
Out[19]:
id 0 age 12 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 1668 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 537 dtype: int64
In [21]:
# 중복값 제거 drop_duplicates()
print(X_train.shape)
df = X_train.drop_duplicates()
df.shape
(29304, 15)
Out[21]:
(29304, 15)
In [ ]:
# 특정 컬럼의 중복값 제거
# df.drop_duplicates(subset=['A'])
# df.drop_duplicates(subset=['A', 'B'], keep='last')
# 기본적으로 뒤에 나오는 값을 삭제
# last 설정시 뒤에 값을 살리고 앞의 값을 삭제
채우기¶
In [23]:
# 최빈값
m = X_train['workclass'].mode()[0]
X_train['workclass'] = X_train['workclass'].fillna(m)
X_train.isnull().sum()
# 결측치를 채울때 넣는 컬럼 지정해주기 주의!!
Out[23]:
id 0 age 12 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 1668 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 537 dtype: int64
In [24]:
# 결측값을 새로운 카테고리로 생성 X_train['occupation']
X_train['occupation'] = X_train['occupation'].fillna('X')
X_train.isnull().sum()
# 무엇으로 채울지는 정답은 없음
# 머신러닝을 돌리면서 적절한 값을 찾아가야함
Out[24]:
id 0 age 12 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 537 dtype: int64
결측치 처리¶
- workclass 최빈값
- native.country 최빈값
- occupation 별도의 카테고리로
In [25]:
# 데이터 불러오기
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')
In [30]:
# X_train데이터
X_train['workclass'] = X_train['workclass'].fillna(X_train['workclass'].mode()[0])
X_train['native.country'] = X_train['native.country'].fillna(X_train['native.country'].mode()[0])
X_train['occupation'] = X_train['occupation'].fillna('X')
### 최빈값을 사용하려 할때 .mode()[0] 꼭 뒤에 인덱스 0 추가!!
In [31]:
# X_test데이터
### 결측치를 처리할 때 반드시 test데이터도 같이 처리해 주기!!
X_test['workclass'] = X_test['workclass'].fillna(X_test['workclass'].mode()[0])
X_test['native.country'] = X_test['native.country'].fillna(X_test['native.country'].mode()[0])
X_test['occupation'] = X_test['occupation'].fillna('X')
In [32]:
# 결측치 확인
X_train.isnull().sum(), X_test.isnull().sum()
Out[32]:
(id 0 age 12 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 0 dtype: int64, id 0 age 6 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 9 native.country 0 dtype: int64)
수치형 변수 결측치¶
- 평균값
- 중앙값
- 최대값
- 최소값
- 그룹별 00 값
In [33]:
#결측치 확인
X_train.isnull().sum()
Out[33]:
id 0 age 12 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 0 dtype: int64
In [34]:
# 평균값
X_train['age'].mean()
Out[34]:
38.553222722927764
In [36]:
# 중앙값
X_train['age'].median()
Out[36]:
37.0
In [37]:
# 최대값
X_train['age'].max()
Out[37]:
90.0
In [38]:
# 최소값
X_train['age'].min()
Out[38]:
-38.0
In [44]:
# age 컬럼 평균값으로 채우기
X_train['age'] = X_train['age'].fillna(X_train['age'].mean())
X_test['age'] = X_test['age'].fillna(X_test['age'].mean())
X_train.isnull().sum()
Out[44]:
id 0 age 0 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 0 native.country 0 dtype: int64
In [45]:
# 주당 근무시간 중앙값으로 채우기
X_train['hours.per.week'] = X_train['hours.per.week'].fillna(X_train['hours.per.week'].median())
X_test['hours.per.week'] = X_test['hours.per.week'].fillna(X_test['hours.per.week'].mean())
X_train.isnull().sum()
Out[45]:
id 0 age 0 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 0 native.country 0 dtype: int64
In [42]:
# X_train 결측치 확인
X_train.isnull().sum()
Out[42]:
id 0 age 0 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 0 native.country 0 dtype: int64
In [46]:
# X_test 결측치 확인
X_test.isnull().sum()
Out[46]:
id 0 age 0 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 0 native.country 0 dtype: int64
이상치 처리¶
- 이상한 값 삭제
In [47]:
# X_train 통계 확인
X_train.describe()
Out[47]:
id | age | fnlwgt | education.num | capital.gain | capital.loss | hours.per.week | |
---|---|---|---|---|---|---|---|
count | 29304.000000 | 29304.000000 | 2.930400e+04 | 29304.000000 | 29304.000000 | 29304.000000 | 29304.000000 |
mean | 16264.027880 | 38.553223 | 1.897488e+05 | 10.080842 | 1093.858722 | 86.744506 | 40.434036 |
std | 9384.518323 | 13.626020 | 1.055250e+05 | 2.570824 | 7477.435640 | 401.518928 | 12.321306 |
min | 0.000000 | -38.000000 | 1.228500e+04 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 8145.750000 | 28.000000 | 1.177890e+05 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 16253.500000 | 37.000000 | 1.783765e+05 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 24374.250000 | 48.000000 | 2.370682e+05 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 32560.000000 | 90.000000 | 1.484705e+06 | 16.000000 | 99999.000000 | 4356.000000 | 99.000000 |
In [48]:
# age가 음수인 데이터
X_train[X_train['age']<=0]
# 두가지 처리방법
# 1) 음수데이터 삭제
# 2) 양수데이터만 추출
Out[48]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | 29188 | -33.0 | Private | 263561 | Some-college | 10 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 60.0 | United-States |
79 | 14325 | -38.0 | Private | 22245 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 60.0 | United-States |
26161 | 4292 | -25.0 | Private | 200681 | Some-college | 10 | Never-married | X | Own-child | White | Male | 0 | 0 | 40.0 | United-States |
In [49]:
# age가 1이상인 데이터만 살림
print(X_train.shape)
X_train = X_train[X_train['age']>0]
print(X_train.shape)
(29304, 15) (29301, 15)
In [50]:
# IQR로 이상치 확인
cols = ['age','fnlwgt','education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
for col in cols:
Q1 = X_train[col].quantile(.25)
Q3 = X_train[col].quantile(.75)
IQR = Q3 - Q1
min_iqr = Q1-1.5*IQR
max_iqr = Q3+1.5*IQR
cnt=sum((X_train[col] < min_iqr) | (X_train[col] > max_iqr))
print(f'{col}의 이상치:{cnt}개 입니다.')
# 시각화가 안되는 시험에서는 이상치를 확인하기 어려움
# 이상치가 주어진다면 누가봐도 이상치거나
# 이상치 기준이 주어질 것으로 예상
age의 이상치:121개 입니다. fnlwgt의 이상치:892개 입니다. education.num의 이상치:1077개 입니다. capital.gain의 이상치:2459개 입니다. capital.loss의 이상치:1359개 입니다. hours.per.week의 이상치:8104개 입니다.
전처리 주의 사항***¶
- 이상치, 결측치에 대해 test 데이터(행) 삭제 불가
- 예) test 데이터 100개가 주어지고 100개로 평가를 하는데 임의로 10개를 삭제해버리면 채점을 할 수가 없음
- train 데이터는 학습/훈련용 데이터임. 데이터가 많을 경우 임의로 소수 데이터(행) 삭제해도 무방함
- test, train 컬럼은 삭제/추가 가능. 단, train과 컬럼수와 명이 일치해야 함 (y(target) 제외)
In [ ]:
'빅데이터 분석 기사 공부' 카테고리의 다른 글
머신러닝_분류모델(범주형 데이터) (0) | 2023.11.06 |
---|---|
피처엔지니어링 (0) | 2023.11.06 |
데이터 불러오기 및 EDA (0) | 2023.11.06 |
머신러닝 프로세스 (0) | 2023.11.06 |
시계열 데이터 (0) | 2023.10.17 |