피처엔지니어링¶

스케일
인코딩
데이터 : data_atype.zip (수업노트에서 다운로드)

In [2]:

# 데이터 불러오기
import pandas as pd
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

In [3]:

# 데이터 샘플 확인
X_train.head()

Out[3]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country
0	3331	34.0	State-gov	177331	Some-college	10	Married-civ-spouse	Prof-specialty	Husband	Black	Male	4386	0	40.0	United-States
1	19749	58.0	Private	290661	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40.0	United-States
2	1157	48.0	Private	125933	Some-college	10	Widowed	Exec-managerial	Unmarried	Black	Female	0	1669	38.0	United-States
3	693	58.0	Private	100313	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	0	1902	40.0	United-States
4	12522	41.0	Private	195661	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	54.0	United-States

데이터 전처리 (이전시간 작업)¶

In [4]:

# X_train데이터
X_train['workclass'] = X_train['workclass'].fillna(X_train['workclass'].mode()[0])
X_train['native.country'] = X_train['native.country'].fillna(X_train['native.country'].mode()[0])
X_train['occupation'] = X_train['occupation'].fillna("X")
X_train['age'] = X_train['age'].fillna(int(X_train['age'].mean()))
X_train['hours.per.week'] = X_train['hours.per.week'].fillna(X_train['hours.per.week'].median())

# X_test데이터
X_test['workclass'] = X_test['workclass'].fillna(X_test['workclass'].mode()[0])
X_test['native.country']  = X_test['native.country'].fillna(X_test['native.country'].mode()[0])
X_test['occupation'] = X_test['occupation'].fillna("X")
X_test['age'] = X_test['age'].fillna(int(X_train['age'].mean()))
X_test['hours.per.week'] = X_test['hours.per.week'].fillna(X_train['hours.per.week'].median())

In [5]:

# 결측치 확인
X_train.isnull().sum()

Out[5]:

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

수치형 데이터와 범주형 데이터 분리¶

In [6]:

# 데이터 타입 확인
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29304 entries, 0 to 29303
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              29304 non-null  int64  
 1   age             29304 non-null  float64
 2   workclass       29304 non-null  object 
 3   fnlwgt          29304 non-null  int64  
 4   education       29304 non-null  object 
 5   education.num   29304 non-null  int64  
 6   marital.status  29304 non-null  object 
 7   occupation      29304 non-null  object 
 8   relationship    29304 non-null  object 
 9   race            29304 non-null  object 
 10  sex             29304 non-null  object 
 11  capital.gain    29304 non-null  int64  
 12  capital.loss    29304 non-null  int64  
 13  hours.per.week  29304 non-null  float64
 14  native.country  29304 non-null  object 
dtypes: float64(2), int64(5), object(8)
memory usage: 3.4+ MB

In [12]:

# 수치형 컬럼과 범주형 컬럼 데이터 나누기
n_train = X_train.select_dtypes(exclude='object').copy()
n_test = X_test.select_dtypes(exclude='object').copy()

c_train = X_train.select_dtypes(include='object').copy()
c_test = X_test.select_dtypes(include='object').copy()


# 데이터를 매번 새롭게 불러오기 위해 함수로 제작 함
def get_nc_data():
    X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
    y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
    X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

    n_train = X_train.select_dtypes(exclude='object').copy()
    n_test = X_test.select_dtypes(exclude='object').copy()
    c_train = X_train.select_dtypes(include='object').copy()
    c_test = X_test.select_dtypes(include='object').copy()
    return n_train, n_test, c_train, c_test

# n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

In [8]:

# 데이처 확인(수치형 데이터)
n_train.head(2)

Out[8]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	34.0	177331	10	4386	0	40.0
1	19749	58.0	290661	9	0	0	40.0

In [9]:

# 데이처 확인(범주형 데이터)
c_train.head(2)

Out[9]:

	workclass	education	marital.status	occupation	relationship	race	sex	native.country
0	State-gov	Some-college	Married-civ-spouse	Prof-specialty	Husband	Black	Male	United-States
1	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	Male	United-States

스케일링¶

트리기반의 모델은 입력의 스케일을 크게 신경쓰지 않아도 됨
선형회귀나 로지스틱 회귀 등과 같은 모델은 입력의 스케일링에 영향을 받음

In [10]:

# 스케일링 작업할 컬럼명(id 제외)
cols = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']

In [11]:

# 민-맥스 스케일링 MinMaxScaler (모든 값이 0과 1사이)
# n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
display(n_train.head(2))
n_train[cols] = scaler.fit_transform(n_train[cols])  
# transform : 변환, fit : 학습
n_test[cols] = scaler.transform(n_test[cols])
# test데이터는 학습은 하지않고 train데이터에서 학습한 내용을 바탕으로 변환만 함
display(n_train.head(2))
# 시험에서는 display함수 사용 못함(주피터 고유 함수)

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	34.0	177331	10	4386	0	40.0
1	19749	58.0	290661	9	0	0	40.0

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	0.5625	0.112092	0.600000	0.04386	0.0	0.397959
1	19749	0.7500	0.189060	0.533333	0.00000	0.0	0.397959

In [14]:

# 표준화 StandardScaler (Z-score 정규화, 평균이 0 표준편차가 1인 표준 정규분포로 변경)
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
display(n_train.head(2))
n_train[cols] = scaler.fit_transform(n_train[cols])
n_test[cols] = scaler.transform(n_test[cols])
display(n_train.head(2))

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	34.0	177331	10	4386	0	40.0
1	19749	58.0	290661	9	0	0	40.0

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	-0.334094	-0.117678	-0.031447	0.440284	-0.216045	-0.035235
1	19749	1.426912	0.956304	-0.420434	-0.146290	-0.216045	-0.035235

In [15]:

# 로버스트 스케일링 : 중앙값과 사분위 값 활용, 이상치 영향 최소화 장점
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
display(n_train.head(2))
n_train[cols] = scaler.fit_transform(n_train[cols])
n_test[cols] = scaler.transform(n_test[cols])
display(n_train.head(2))

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	34.0	177331	10	4386	0	40.0
1	19749	58.0	290661	9	0	0	40.0

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	3331	-0.15	-0.008765	0.000000	4386.0	0.0	0.0
1	19749	1.05	0.941358	-0.333333	0.0	0.0	0.0

In [17]:

# 로그 변환 예시
X_train['fnlwgt'].hist()

Out[17]:

<Axes: >

In [18]:

# 로그 변환 전후 확인
import numpy as np
print(X_train['fnlwgt'][:3])
np.log1p(X_train['fnlwgt'])[:3] 

0    177331
1    290661
2    125933
Name: fnlwgt, dtype: int64

Out[18]:

0    12.085779
1    12.579916
2    11.743513
Name: fnlwgt, dtype: float64

In [19]:

# 로그 변환 후 시각화
np.log1p(X_train['fnlwgt']).hist() 

Out[19]:

<Axes: >

In [20]:

# np.exp
np.exp(np.log1p(X_train['fnlwgt']))

Out[20]:

0        177332.0
1        290662.0
2        125934.0
3        100314.0
4        195662.0
           ...   
29299     47169.0
29300    231794.0
29301    201436.0
29302    137723.0
29303    406979.0
Name: fnlwgt, Length: 29304, dtype: float64

인코딩¶

라벨(label) 인코딩
원핫(one-hot) 인코딩

In [21]:

# 범주형 데이터 확인(인코딩 전)
c_train.head()

Out[21]:

	workclass	education	marital.status	occupation	relationship	race	sex	native.country
0	State-gov	Some-college	Married-civ-spouse	Prof-specialty	Husband	Black	Male	United-States
1	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	Male	United-States
2	Private	Some-college	Widowed	Exec-managerial	Unmarried	Black	Female	United-States
3	Private	Some-college	Married-civ-spouse	Protective-serv	Husband	White	Male	United-States
4	Private	Some-college	Married-civ-spouse	Transport-moving	Husband	White	Male	United-States

In [26]:

# object 컬럼명
cols = list(c_train.columns)
cols
# cols = list(X_train.columns[X_tran.dtype == object])

Out[26]:

['workclass',
 'education',
 'marital.status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native.country']

In [28]:

# 라벨 인코딩
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in cols:
    le = LabelEncoder()
    c_train[col] = le.fit_transform(c_train[col])
    c_test[col] = le.transform(c_test[col])

In [30]:

# 범주형 데이터 확인(인코딩 후)
c_train.head()

Out[30]:

	workclass	education	marital.status	occupation	relationship	race	sex	native.country
0	6	15	2	9	0	2	1	38
1	3	11	2	2	0	4	1	38
2	3	15	6	3	4	2	0	38
3	3	15	2	10	0	4	1	38
4	3	15	2	13	0	4	1	38

In [32]:

# 원핫 인코딩 - 판다스에서 지원해줌
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
display(c_train.head())
c_train = pd.get_dummies(c_train[cols])
c_test = pd.get_dummies(c_test[cols])
display(c_train.head())

	workclass	education	marital.status	occupation	relationship	race	sex	native.country
0	State-gov	Some-college	Married-civ-spouse	Prof-specialty	Husband	Black	Male	United-States
1	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	Male	United-States
2	Private	Some-college	Widowed	Exec-managerial	Unmarried	Black	Female	United-States
3	Private	Some-college	Married-civ-spouse	Protective-serv	Husband	White	Male	United-States
4	Private	Some-college	Married-civ-spouse	Transport-moving	Husband	White	Male	United-States

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

5 rows × 99 columns

데이터 합치기¶

In [33]:

# 분리한 데이터 다시 합침
# n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기
X_train = pd.concat([n_train, c_train], axis = 1)
# 기본값은 axis=0으로 아래로 합치게 되어있음
# 옆으로 합쳐야되므로 axis=1
X_test = pd.concat([n_test, c_test], axis=1)
print(X_train.shape, X_test.shape)
X_train.head()
# 원핫인코딩 후 컬럼수를 비교해보니 train데이터와 test데이터의 컬럼수가 다르다
# train데이터와 test데이터를먼저 합친 후 원핫인코딩을 진행한 후 다시 분리하는 작업필요
# 원핫인코딩이 아니더라도 train데이터에서 없었던 데이터가 test데이터에서 있을경우 라벨인코딩 에러
# 라벨인코딩을 하려면 train데이터와 test데이터를 합친 후 인코딩 해줘야함(이때 위아래로 합쳐야함)

(29304, 106) (3257, 102)

Out[33]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week	...	native.country_United-States
0	3331	34.0	177331	10	4386	0	40.0	...	1
1	19749	58.0	290661	9	0	0	40.0	...	1
2	1157	48.0	125933	10	0	1669	38.0	...	1
3	693	58.0	100313	10	0	1902	40.0	...	1
4	12522	41.0	195661	10	0	0	54.0	...	1

5 rows × 106 columns

In [40]:

# 데이터 새로 불러오기
import pandas as pd
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

In [41]:

# train, test 합쳐서 인코딩 후 분리하기
cols = list(X_train.columns[X_train.dtypes == object])
print(X_train.shape, X_test.shape)

all_df = pd.concat([X_train, X_test])
# 기본값이 위아래
all_df = pd.get_dummies(all_df[cols])

# 분리
line = int(X_train.shape[0])
X_train = all_df.iloc[:line, :].copy()
X_train

X_test = all_df.iloc[line:, :].copy()
X_test

print(X_train.shape, X_test.shape)
# 원핫인코딩 후에도 train데이터와 test데이터의 컬럼수가 같아짐

(29304, 15) (3257, 15)
(29304, 99) (3257, 99)

In [39]:

X_train.head()

Out[39]:

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

5 rows × 99 columns

정리¶

In [51]:

# 데이터 분리
# 수치형 컬럼과 범주형 컬럼 데이터 나누기
n_train = X_train.select_dtypes(exclude='object').copy()
n_test = X_test.select_dtypes(exclude='object').copy()

c_train = X_train.select_dtypes(include='object').copy()
c_test = X_test.select_dtypes(include='object').copy()


# 데이터를 매번 새롭게 불러오기 위해 함수로 제작 함
def get_nc_data():
    X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
    y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
    X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

    n_train = X_train.select_dtypes(exclude='object').copy()
    n_test = X_test.select_dtypes(exclude='object').copy()
    c_train = X_train.select_dtypes(include='object').copy()
    c_test = X_test.select_dtypes(include='object').copy()
    return n_train, n_test, c_train, c_test
n_train, n_test, c_train, c_test = get_nc_data() # 데이터 새로 불러오기

In [52]:

# 수치형 - 민맥스 스케일링
cols = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
n_train[cols] = scaler.fit_transform(n_train[cols])
n_test[cols] = scaler.transform(n_test[cols])

In [54]:

# 라벨 인코딩
cols = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cols:
    le = LabelEncoder()
    c_train[col] = le.fit_transform(c_train[col])
    c_test[col] = le.transform(c_test[col])

In [55]:

# 분리한 데이터 다시 합침
X_train = pd.concat([n_train, c_train], axis = 1)
X_test = pd.concat([n_test, c_test], axis = 1)
print(X_train.shape, X_test.shape)
X_train.head()

(29304, 15) (3257, 15)

Out[55]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week	workclass	education	marital.status	occupation	relationship	race	sex	native.country
0	3331	0.562500	0.112092	0.600000	0.04386	0.000000	0.397959	6	15	2	9	0	2	1	38
1	19749	0.750000	0.189060	0.533333	0.00000	0.000000	0.397959	3	11	2	2	0	4	1	38
2	1157	0.671875	0.077184	0.600000	0.00000	0.383150	0.377551	3	15	6	3	4	2	0	38
3	693	0.750000	0.059785	0.600000	0.00000	0.436639	0.397959	3	15	2	10	0	4	1	38
4	12522	0.617188	0.124541	0.600000	0.00000	0.000000	0.540816	3	15	2	13	0	4	1	38

In [57]:

# 데이터 확인
X_test.head()

Out[57]:

	id	age	fnlwgt	education.num	hours.per.week	workclass	education	marital.status	occupation	relationship	race	sex	native.country
0	11574	0.601562	0.069118	0.800000	0.397959	6	9	4	3	1	4	0	38
1	15847	0.593750	0.164239	0.600000	0.397959	3	15	3	9	3	2	0	38
2	17655	0.640625	0.029278	0.533333	NaN	6	11	4	2	1	4	1	38
3	19790	0.664062	0.010697	0.800000	0.500000	3	9	2	3	0	4	1	38
4	31812	0.781250	0.118394	0.533333	0.397959	8	11	4	14	1	4	1	38

In [ ]:

머신러닝_회귀모델(수치형 데이터) (0)	2023.11.06
머신러닝_분류모델(범주형 데이터) (0)	2023.11.06
데이터 전처리 (0)	2023.11.06
데이터 불러오기 및 EDA (0)	2023.11.06
머신러닝 프로세스 (0)	2023.11.06

Recording me

피처엔지니어링

피처엔지니어링¶

데이터 전처리 (이전시간 작업)¶

수치형 데이터와 범주형 데이터 분리¶

스케일링¶

인코딩¶

데이터 합치기¶

정리¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

티스토리툴바

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

피처엔지니어링

피처엔지니어링¶

데이터 전처리 (이전시간 작업)¶

수치형 데이터와 범주형 데이터 분리¶

스케일링¶

인코딩¶

데이터 합치기¶

정리¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

관련글

티스토리툴바

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1

	workclass_Private	workclass_State-gov	...	native.country_United-States
0	0	1	...	1
1	1	0	...	1
2	1	0	...	1
3	1	0	...	1
4	1	0	...	1