데이터 전처리¶

결측치 처리
이상치 처리
데이터 : data_atype.zip (수업노트에서 다운로드)

라이브러리 및 데이터 불러오기¶

In [1]:

import pandas as pd
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

결측치¶

In [2]:

# 데이터 샘플 확인
X_train.sample()

Out[2]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country
2042	12973	42.0	Private	195124	7th-8th	4	Married-spouse-absent	Prof-specialty	Other-relative	White	Male	0	0	35.0	Puerto-Rico

In [3]:

# 결측치 컬럼 확인
X_train.isnull().sum()

Out[3]:

id                   0
age                 12
workclass         1662
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1668
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country     537
dtype: int64

In [4]:

# 데이터 타입 확인
X_train.info()
# 결측치가 있는 컬럼의 타입 확인 - 수치형인지 범주형(object)인지

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29304 entries, 0 to 29303
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              29304 non-null  int64  
 1   age             29292 non-null  float64
 2   workclass       27642 non-null  object 
 3   fnlwgt          29304 non-null  int64  
 4   education       29304 non-null  object 
 5   education.num   29304 non-null  int64  
 6   marital.status  29304 non-null  object 
 7   occupation      27636 non-null  object 
 8   relationship    29304 non-null  object 
 9   race            29304 non-null  object 
 10  sex             29304 non-null  object 
 11  capital.gain    29304 non-null  int64  
 12  capital.loss    29304 non-null  int64  
 13  hours.per.week  29291 non-null  float64
 14  native.country  28767 non-null  object 
dtypes: float64(2), int64(5), object(8)
memory usage: 3.4+ MB

In [6]:

# workclass 컬럼 고유 값 개수
X_train.describe(include='O')
X_train['workclass'].value_counts()

Out[6]:

Private             20451
Self-emp-not-inc     2292
Local-gov            1863
State-gov            1170
Self-emp-inc          991
Federal-gov           856
Without-pay            13
Never-worked            6
Name: workclass, dtype: int64

In [7]:

# occupation 컬럼 고유 값 개수
X_train['occupation'].value_counts()
# 결측치를 채우기에는 workclass처럼 최빈값으로 하기에는 무리가 있겠다 판단

Out[7]:

Prof-specialty       3683
Craft-repair         3677
Exec-managerial      3632
Adm-clerical         3411
Sales                3295
Other-service        2993
Machine-op-inspct    1822
Transport-moving     1434
Handlers-cleaners    1223
Farming-fishing       891
Tech-support          842
Protective-serv       590
Priv-house-serv       135
Armed-Forces            8
Name: occupation, dtype: int64

In [10]:

# native.country 컬럼 고유 값 개수
X_train['native.country'].value_counts()

Out[10]:

United-States                 26240
Mexico                          576
Philippines                     178
Germany                         120
Canada                          109
Puerto-Rico                      98
India                            91
El-Salvador                      91
Cuba                             87
England                          79
Jamaica                          77
South                            75
China                            70
Italy                            68
Vietnam                          64
Dominican-Republic               63
Japan                            56
Columbia                         55
Guatemala                        53
Poland                           50
Taiwan                           50
Haiti                            40
Iran                             37
Portugal                         33
Nicaragua                        32
Greece                           27
Ecuador                          26
France                           26
Peru                             24
Ireland                          22
Hong                             19
Thailand                         17
Cambodia                         17
Laos                             17
Yugoslavia                       15
Trinadad&Tobago                  14
Hungary                          13
Honduras                         13
Scotland                         12
Outlying-US(Guam-USVI-etc)       12
Holand-Netherlands                1
Name: native.country, dtype: int64

범주형 변수 결측치¶

삭제
최빈값
없는값

삭제¶

In [11]:

# X_train과 X_test 데이터 크기 확인
X_train.shape, X_test.shape

Out[11]:

((29304, 15), (3257, 15))

In [12]:

# 결측치 확인
X_train.isnull().sum()

Out[12]:

id                   0
age                 12
workclass         1662
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1668
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country     537
dtype: int64

In [14]:

# 결측치가 있는 데이터(행) 전체 삭제 및 확인 dropna() #기본값 axis=0
df = X_train.dropna()
df.isnull().sum()

Out[14]:

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

In [15]:

df.shape

Out[15]:

(27096, 15)

In [16]:

# 특정컬럼에 결측치가 있으면 데이터(행) 삭제 subset=['native.country']
df = X_train.dropna(subset=['native.country'])
df.isnull().sum()

Out[16]:

id                   0
age                 12
workclass         1640
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1646
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country       0
dtype: int64

In [17]:

# 특정컬럼에 결측치가 있으면 데이터(행) 삭제 subset=['native.country', 'workclass']
df = X_train.dropna(subset=['native.country', 'workclass'])
df.isnull().sum()

Out[17]:

id                 0
age               12
workclass          0
fnlwgt             0
education          0
education.num      0
marital.status     0
occupation         6
relationship       0
race               0
sex                0
capital.gain       0
capital.loss       0
hours.per.week    13
native.country     0
dtype: int64

In [18]:

# 결측치가 있는 컬럼 삭제 dropna(axis=1)
df= X_train.dropna(axis=1)
df.isnull().sum()
# 작업형1에서는 나올수도 있음

Out[18]:

id                0
fnlwgt            0
education         0
education.num     0
marital.status    0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
dtype: int64

In [19]:

# 결측치가 많은 특정 컬럼 삭제 drop(['workclass'], axis=1)
df= X_train.drop(['workclass'], axis=1)
df.isnull().sum()

Out[19]:

id                   0
age                 12
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1668
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country     537
dtype: int64

In [21]:

# 중복값 제거 drop_duplicates()
print(X_train.shape)
df = X_train.drop_duplicates()
df.shape

(29304, 15)

Out[21]:

(29304, 15)

In [ ]:

# 특정 컬럼의 중복값 제거
# df.drop_duplicates(subset=['A'])
# df.drop_duplicates(subset=['A', 'B'], keep='last')  
# 기본적으로 뒤에 나오는 값을 삭제
# last 설정시 뒤에 값을 살리고 앞의 값을 삭제

채우기¶

In [23]:

# 최빈값
m = X_train['workclass'].mode()[0]
X_train['workclass'] = X_train['workclass'].fillna(m)
X_train.isnull().sum()
# 결측치를 채울때 넣는 컬럼 지정해주기 주의!!

Out[23]:

id                   0
age                 12
workclass            0
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1668
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country     537
dtype: int64

In [24]:

# 결측값을 새로운 카테고리로 생성 X_train['occupation']
X_train['occupation'] = X_train['occupation'].fillna('X')
X_train.isnull().sum()
# 무엇으로 채울지는 정답은 없음
# 머신러닝을 돌리면서 적절한 값을 찾아가야함

Out[24]:

id                  0
age                12
workclass           0
fnlwgt              0
education           0
education.num       0
marital.status      0
occupation          0
relationship        0
race                0
sex                 0
capital.gain        0
capital.loss        0
hours.per.week     13
native.country    537
dtype: int64

결측치 처리¶

workclass 최빈값
native.country 최빈값
occupation 별도의 카테고리로

In [25]:

# 데이터 불러오기
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

In [30]:

# X_train데이터
X_train['workclass'] = X_train['workclass'].fillna(X_train['workclass'].mode()[0])
X_train['native.country'] = X_train['native.country'].fillna(X_train['native.country'].mode()[0])
X_train['occupation'] = X_train['occupation'].fillna('X')
### 최빈값을 사용하려 할때 .mode()[0] 꼭 뒤에 인덱스 0  추가!!

In [31]:

# X_test데이터
### 결측치를 처리할 때 반드시 test데이터도 같이 처리해 주기!!
X_test['workclass'] = X_test['workclass'].fillna(X_test['workclass'].mode()[0])
X_test['native.country'] = X_test['native.country'].fillna(X_test['native.country'].mode()[0])
X_test['occupation'] = X_test['occupation'].fillna('X')

In [32]:

# 결측치 확인
X_train.isnull().sum(), X_test.isnull().sum()

Out[32]:

(id                 0
 age               12
 workclass          0
 fnlwgt             0
 education          0
 education.num      0
 marital.status     0
 occupation         0
 relationship       0
 race               0
 sex                0
 capital.gain       0
 capital.loss       0
 hours.per.week    13
 native.country     0
 dtype: int64,
 id                0
 age               6
 workclass         0
 fnlwgt            0
 education         0
 education.num     0
 marital.status    0
 occupation        0
 relationship      0
 race              0
 sex               0
 capital.gain      0
 capital.loss      0
 hours.per.week    9
 native.country    0
 dtype: int64)

수치형 변수 결측치¶

평균값
중앙값
최대값
최소값
그룹별 00 값

In [33]:

#결측치 확인
X_train.isnull().sum()

Out[33]:

id                 0
age               12
workclass          0
fnlwgt             0
education          0
education.num      0
marital.status     0
occupation         0
relationship       0
race               0
sex                0
capital.gain       0
capital.loss       0
hours.per.week    13
native.country     0
dtype: int64

In [34]:

# 평균값
X_train['age'].mean()

Out[34]:

38.553222722927764

In [36]:

# 중앙값
X_train['age'].median()

Out[36]:

37.0

In [37]:

# 최대값
X_train['age'].max()

Out[37]:

90.0

In [38]:

# 최소값
X_train['age'].min()

Out[38]:

-38.0

In [44]:

# age 컬럼 평균값으로 채우기
X_train['age'] = X_train['age'].fillna(X_train['age'].mean())
X_test['age'] = X_test['age'].fillna(X_test['age'].mean())
X_train.isnull().sum()

Out[44]:

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

In [45]:

# 주당 근무시간 중앙값으로 채우기
X_train['hours.per.week'] = X_train['hours.per.week'].fillna(X_train['hours.per.week'].median())
X_test['hours.per.week'] = X_test['hours.per.week'].fillna(X_test['hours.per.week'].mean())
X_train.isnull().sum()

Out[45]:

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

In [42]:

# X_train 결측치 확인
X_train.isnull().sum()

Out[42]:

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

In [46]:

# X_test 결측치 확인
X_test.isnull().sum()

Out[46]:

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

이상치 처리¶

이상한 값 삭제

In [47]:

# X_train 통계 확인
X_train.describe()

Out[47]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
count	29304.000000	29304.000000	2.930400e+04	29304.000000	29304.000000	29304.000000	29304.000000
mean	16264.027880	38.553223	1.897488e+05	10.080842	1093.858722	86.744506	40.434036
std	9384.518323	13.626020	1.055250e+05	2.570824	7477.435640	401.518928	12.321306
min	0.000000	-38.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	8145.750000	28.000000	1.177890e+05	9.000000	0.000000	0.000000	40.000000
50%	16253.500000	37.000000	1.783765e+05	10.000000	0.000000	0.000000	40.000000
75%	24374.250000	48.000000	2.370682e+05	12.000000	0.000000	0.000000	45.000000
max	32560.000000	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

In [48]:

# age가 음수인 데이터
X_train[X_train['age']<=0]
# 두가지 처리방법
# 1) 음수데이터 삭제
# 2) 양수데이터만 추출

Out[48]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	hours.per.week	native.country
39	29188	-33.0	Private	263561	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	60.0	United-States
79	14325	-38.0	Private	22245	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	60.0	United-States
26161	4292	-25.0	Private	200681	Some-college	10	Never-married	X	Own-child	White	Male	40.0	United-States

In [49]:

# age가 1이상인 데이터만 살림
print(X_train.shape)
X_train = X_train[X_train['age']>0]
print(X_train.shape)

(29304, 15)
(29301, 15)

In [50]:

# IQR로 이상치 확인
cols = ['age','fnlwgt','education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
for col in cols:
    Q1 = X_train[col].quantile(.25)
    Q3 = X_train[col].quantile(.75)
    IQR = Q3 - Q1
    min_iqr = Q1-1.5*IQR
    max_iqr = Q3+1.5*IQR
    cnt=sum((X_train[col] < min_iqr) | (X_train[col] > max_iqr))
    print(f'{col}의 이상치:{cnt}개 입니다.')

# 시각화가 안되는 시험에서는 이상치를 확인하기 어려움
# 이상치가 주어진다면 누가봐도 이상치거나
# 이상치 기준이 주어질 것으로 예상

age의 이상치:121개 입니다.
fnlwgt의 이상치:892개 입니다.
education.num의 이상치:1077개 입니다.
capital.gain의 이상치:2459개 입니다.
capital.loss의 이상치:1359개 입니다.
hours.per.week의 이상치:8104개 입니다.

전처리 주의 사항***¶

이상치, 결측치에 대해 test 데이터(행) 삭제 불가
- 예) test 데이터 100개가 주어지고 100개로 평가를 하는데 임의로 10개를 삭제해버리면 채점을 할 수가 없음
- train 데이터는 학습/훈련용 데이터임. 데이터가 많을 경우 임의로 소수 데이터(행) 삭제해도 무방함
test, train 컬럼은 삭제/추가 가능. 단, train과 컬럼수와 명이 일치해야 함 (y(target) 제외)

In [ ]:

머신러닝_분류모델(범주형 데이터) (0)	2023.11.06
피처엔지니어링 (0)	2023.11.06
데이터 불러오기 및 EDA (0)	2023.11.06
머신러닝 프로세스 (0)	2023.11.06
시계열 데이터 (0)	2023.10.17

Recording me

데이터 전처리

데이터 전처리¶

라이브러리 및 데이터 불러오기¶

결측치¶

범주형 변수 결측치¶

삭제¶

채우기¶

결측치 처리¶

수치형 변수 결측치¶

이상치 처리¶

전처리 주의 사항***¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

티스토리툴바

데이터 전처리

데이터 전처리¶

라이브러리 및 데이터 불러오기¶

결측치¶

범주형 변수 결측치¶

삭제¶

채우기¶

결측치 처리¶

수치형 변수 결측치¶

이상치 처리¶

전처리 주의 사항***¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

관련글

티스토리툴바