라이브러리 및 데이터 불러오기 그리고 심플 EDA¶

데이터: 성인 인구조사 소득 예측¶

id : 고유의 id 값
age : 나이
workclass : 고용 형태
fnlwgt : 사람 대표성을 나타내는 가중치 (final weight의 약자)
education : 교육 수준
education_num : 교육 수준 수치
marital_status: 결혼 상태
occupation : 업종
relationship : 가족 관계
race : 인종
sex : 성별
capital_gain : 양도 소득
capital_loss : 양도 손실
hours_per_week : 주당 근무 시간
native_country : 국적
income : 수익(예측해야 하는 값)
- "> 50K" : 1
- "<= 50K" : 0

라이브러리 불러오기¶

In [2]:

# 판다스 라이브러리
import pandas as pd

데이터 불러오기 (3개 일때)¶

3개의 데이터가 주어지는 경우
data_atype.zip (수업노트에서 다운로드)

In [4]:

# 데이터 불러오기
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

EDA¶

In [5]:

# 데이터 샘플(상위 5개)
X_train.head()

Out[5]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country
0	3331	34.0	State-gov	177331	Some-college	10	Married-civ-spouse	Prof-specialty	Husband	Black	Male	4386	0	40.0	United-States
1	19749	58.0	Private	290661	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40.0	United-States
2	1157	48.0	Private	125933	Some-college	10	Widowed	Exec-managerial	Unmarried	Black	Female	0	1669	38.0	United-States
3	693	58.0	Private	100313	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	0	1902	40.0	United-States
4	12522	41.0	Private	195661	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	54.0	United-States

In [6]:

# 데이터 샘플(하위 5개)
X_train.tail()

Out[6]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	hours.per.week	native.country
29299	15999	28.0	Private	47168	10th	6	Never-married	Machine-op-inspct	Own-child	White	Female	40.0	United-States
29300	21604	44.0	Local-gov	231793	Doctorate	16	Married-spouse-absent	Prof-specialty	Unmarried	White	Female	38.0	United-States
29301	26839	41.0	Local-gov	201435	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	Black	Male	40.0	United-States
29302	16681	43.0	Private	137722	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	40.0	United-States
29303	16069	22.0	Private	406978	Bachelors	13	Never-married	Exec-managerial	Other-relative	White	Female	40.0	United-States

In [8]:

# 데이터 샘플(랜덤)
X_train.sample(3)

Out[8]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	hours.per.week	native.country
25338	13469	20.0	Private	279763	11th	7	Never-married	Craft-repair	Not-in-family	Black	Male	25.0	United-States
19030	21822	45.0	State-gov	90803	Some-college	10	Married-civ-spouse	Adm-clerical	Husband	White	Male	40.0	United-States
16768	28390	36.0	Private	156352	9th	5	Never-married	Handlers-cleaners	Own-child	White	Male	40.0	United-States

In [9]:

# 데이터 크기
X_train.shape
### 괄호 없음 주의!!!

Out[9]:

(29304, 15)

In [10]:

# 타입
X_train.info()
# float, int : 수치형
# object : 범주형

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29304 entries, 0 to 29303
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              29304 non-null  int64  
 1   age             29292 non-null  float64
 2   workclass       27642 non-null  object 
 3   fnlwgt          29304 non-null  int64  
 4   education       29304 non-null  object 
 5   education.num   29304 non-null  int64  
 6   marital.status  29304 non-null  object 
 7   occupation      27636 non-null  object 
 8   relationship    29304 non-null  object 
 9   race            29304 non-null  object 
 10  sex             29304 non-null  object 
 11  capital.gain    29304 non-null  int64  
 12  capital.loss    29304 non-null  int64  
 13  hours.per.week  29291 non-null  float64
 14  native.country  28767 non-null  object 
dtypes: float64(2), int64(5), object(8)
memory usage: 3.4+ MB

In [11]:

# 수치형 컬럼 통계값 확인(train)
X_train.describe()
# age에 -값 -> 이상치인가?
# 각 컬럼별로 통계값들을 가지고 결측치, 이상치 대략 판별 / 데이터 이해

Out[11]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
count	29304.000000	29292.000000	2.930400e+04	29304.000000	29304.000000	29304.000000	29291.000000
mean	16264.027880	38.553223	1.897488e+05	10.080842	1093.858722	86.744506	40.434229
std	9384.518323	13.628811	1.055250e+05	2.570824	7477.435640	401.518928	12.324036
min	0.000000	-38.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	8145.750000	28.000000	1.177890e+05	9.000000	0.000000	0.000000	40.000000
50%	16253.500000	37.000000	1.783765e+05	10.000000	0.000000	0.000000	40.000000
75%	24374.250000	48.000000	2.370682e+05	12.000000	0.000000	0.000000	45.000000
max	32560.000000	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

In [12]:

# 수치형 컬럼 통계값 확인(test)
X_test.describe()

Out[12]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
count	3257.000000	3251.000000	3.257000e+03	3257.000000	3257.000000	3257.000000	3248.000000
mean	16423.704943	38.802830	1.900447e+05	10.079214	931.804728	92.336199	40.468288
std	9535.416746	13.917588	1.057902e+05	2.590118	6496.962999	415.732721	12.598546
min	3.000000	17.000000	1.882700e+04	1.000000	0.000000	0.000000	1.000000
25%	8078.000000	28.000000	1.186520e+05	9.000000	0.000000	0.000000	40.000000
50%	16626.000000	37.000000	1.783190e+05	10.000000	0.000000	0.000000	40.000000
75%	24743.000000	48.000000	2.364360e+05	12.000000	0.000000	0.000000	45.000000
max	32559.000000	90.000000	1.033222e+06	16.000000	99999.000000	3900.000000	99.000000

In [15]:

# 범주형 컬럼 통계값 확인 (train)
X_train.describe(include='object')
X_train.describe(include='O')   # 대문자 O
# 둘중 아무거나 해도됨

# top : 최빈값
# freq : 최빈값의 빈도수

Out[15]:

	workclass	education	marital.status	occupation	relationship	race	sex	native.country
count	27642	29304	29304	27636	29304	29304	29304	28767
unique	8	16	7	14	6	5	2	41
top	Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States
freq	20451	9449	13466	3683	11845	25022	19578	26240

In [17]:

# 범주형 컬럼 통계값 확인 (test)
X_test.describe(include = 'object')

# train 데이터와 test 데이터의 unique 값이 같은지 확인
# unique 값이 같아도 값이 다를 수 있지만 일단 수치 확인
# country 차이가 있음 확인

Out[17]:

	workclass	education	marital.status	occupation	relationship	race	sex	native.country
count	3083	3257	3257	3082	3257	3257	3257	3211
unique	8	16	7	14	6	5	2	37
top	Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States
freq	2245	1052	1510	457	1348	2794	2212	2930

In [18]:

# 결측치 (train)
X_train.isnull().sum()

Out[18]:

id                   0
age                 12
workclass         1662
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1668
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country     537
dtype: int64

In [20]:

# 결측치 (test)
X_test.isnull().sum()

Out[20]:

id                  0
age                 6
workclass         174
fnlwgt              0
education           0
education.num       0
marital.status      0
occupation        175
relationship        0
race                0
sex                 0
capital.gain        0
capital.loss        0
hours.per.week      9
native.country     46
dtype: int64

In [21]:

# y_train 샘플 확인
y_train.head()

Out[21]:

	id	income
0	3331	>50K
1	19749	<=50K
2	1157	<=50K
3	693	>50K
4	12522	<=50K

In [23]:

# target(label)별 개수 확인
y_train['income'].value_counts()

Out[23]:

<=50K    22263
>50K      7041
Name: income, dtype: int64

In [24]:

# 상관관계
X_train.corr()
# 시험에서는 시각화를 지원하지 않아 이렇게 상관관계를 파악

C:\Users\Master\AppData\Local\Temp\ipykernel_16564\3366806260.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  X_train.corr()

Out[24]:

	id	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
id	1.000000	-0.077323	0.003550	-0.090382	-0.221315	-0.358569	-0.056766
age	-0.077323	1.000000	-0.075860	0.035072	0.077384	0.059372	0.068504
fnlwgt	0.003550	-0.075860	1.000000	-0.039059	0.000265	-0.009576	-0.018948
education.num	-0.090382	0.035072	-0.039059	1.000000	0.125894	0.079391	0.147519
capital.gain	-0.221315	0.077384	0.000265	0.125894	1.000000	-0.031605	0.081179
capital.loss	-0.358569	0.059372	-0.009576	0.079391	-0.031605	1.000000	0.053508
hours.per.week	-0.056766	0.068504	-0.018948	0.147519	0.081179	0.053508	1.000000

데이터 불러오기 (2개 일때)¶

2개의 데이터가 주어지는 경우
data_btype.zip (수업노트에서 다운로드)

In [27]:

# 데이터 불러오기
train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_btype\train.csv')
test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_btype\test.csv')

In [31]:

# 데이터 크기
train.shape, test.shape
# 시험에서 데이터가 2개가 주어질지 3개가 주어질지 모름 -> 모두 연습 필요

Out[31]:

((29304, 16), (3257, 15))

In [29]:

# 데이터 샘플
train.sample(3)
# income이 3개일때랑 다르게 train에 함께 들어가 있음

Out[29]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country	income
19682	26519	21.0	Private	89991	Some-college	10	Never-married	Adm-clerical	Own-child	White	Female	0	0	40.0	United-States	<=50K
20395	410	52.0	Private	139671	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	1977	50.0	United-States	>50K
3233	3149	52.0	State-gov	135388	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	5013	0	40.0	United-States	<=50K

In [32]:

# 예측 해야할 값
train['income'].value_counts()

Out[32]:

<=50K    22263
>50K      7041
Name: income, dtype: int64

In [33]:

# 조건
cond0 = train['income'] == '<=50K'  # 0 하위 소득
cond1 = train['income'] == '>50K'   # 1 상위 소득

cond_female = train['sex'] == 'Female'
cond_male = train['sex'] == "Male"

In [35]:

# 남성과 여성의 수
len(train[cond_male]), len(train[cond_female])

Out[35]:

(19578, 9726)

In [36]:

# 남성 중 0과 1 (인원 수)
len(train[cond0&cond_male]), len(train[cond1&cond_male])

Out[36]:

(13602, 5976)

In [37]:

# 남성 중 0과 1 (비율)
len(train[cond0&cond_male])/len(train[cond_male]), len(train[cond1&cond_male])/len(train[cond_male])

Out[37]:

(0.6947594238430892, 0.3052405761569108)

In [38]:

# 여성 중 0과 1 (인원 수)
len(train[cond0&cond_female]), len(train[cond1&cond_female])

Out[38]:

(8661, 1065)

In [39]:

# 여성 중 0과 1 (비율)
len(train[cond0&cond_female])/len(train[cond_female]), len(train[cond1&cond_female])/len(train[cond_female])

Out[39]:

(0.8904996915484269, 0.1095003084515731)

추가적인 EDA¶

성별에 따른 소득 수준
결혼 유무에 따른 소득 수준
직업에 따른 소득 수준
가족 관계에 따른 소득 수준
국적/백인에 따른 소득 수준 (불균형 미국, 백인 데이터가 대부분임)

주어진 데이터 합치기, 분리¶

In [ ]:

# 데이터를 자유롭게 사용하기 위해 합치거나 분리함

In [40]:

# 데이터 크기
train.shape, X_train.shape, y_train.shape
# X_train과 y_train을 합쳤을때 train데이터와 컬럼수가 맞지 않음

Out[40]:

((29304, 16), (29304, 15), (29304, 2))

In [42]:

# 데이터 확인
y_train.head(1)
# y_train에 id컬럼이 중복으로 들어가있음

Out[42]:

	id	income
0	3331	>50K

In [43]:

# X_train y_train 합치는 것 예시
df = pd.concat([X_train, y_train['income']], axis=1)
df.shape

Out[43]:

(29304, 16)

In [44]:

# 데이터 확인
df.head()

Out[44]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country	income
0	3331	34.0	State-gov	177331	Some-college	10	Married-civ-spouse	Prof-specialty	Husband	Black	Male	4386	0	40.0	United-States	>50K
1	19749	58.0	Private	290661	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40.0	United-States	<=50K
2	1157	48.0	Private	125933	Some-college	10	Widowed	Exec-managerial	Unmarried	Black	Female	0	1669	38.0	United-States	<=50K
3	693	58.0	Private	100313	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	0	1902	40.0	United-States	>50K
4	12522	41.0	Private	195661	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	54.0	United-States	<=50K

In [45]:

# train 분리 예시
X_tr = train.iloc[:,:-1].copy()
y_tr = train.iloc[:, [0,-1]].copy()
X_tr.shape, y_tr.shape
# copy를 써주지 않으면 warning이 뜰수있음 - 복사시 쓰는 것을 습관들이기

Out[45]:

((29304, 15), (29304, 2))

In [46]:

# 데이터 확인
y_tr.head()

Out[46]:

	id	income
0	3331	>50K
1	19749	<=50K
2	1157	<=50K
3	693	>50K
4	12522	<=50K

In [47]:

X_tr.head()

Out[47]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country
0	3331	34.0	State-gov	177331	Some-college	10	Married-civ-spouse	Prof-specialty	Husband	Black	Male	4386	0	40.0	United-States
1	19749	58.0	Private	290661	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40.0	United-States
2	1157	48.0	Private	125933	Some-college	10	Widowed	Exec-managerial	Unmarried	Black	Female	0	1669	38.0	United-States
3	693	58.0	Private	100313	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	0	1902	40.0	United-States
4	12522	41.0	Private	195661	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	54.0	United-States

In [ ]:

피처엔지니어링 (0)	2023.11.06
데이터 전처리 (0)	2023.11.06
머신러닝 프로세스 (0)	2023.11.06
시계열 데이터 (0)	2023.10.17
판다스 기초 (0)	2023.10.14

Recording me

데이터 불러오기 및 EDA

라이브러리 및 데이터 불러오기 그리고 심플 EDA¶

데이터: 성인 인구조사 소득 예측¶

라이브러리 불러오기¶

데이터 불러오기 (3개 일때)¶

EDA¶

데이터 불러오기 (2개 일때)¶

추가적인 EDA¶

주어진 데이터 합치기, 분리¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

티스토리툴바

데이터 불러오기 및 EDA

라이브러리 및 데이터 불러오기 그리고 심플 EDA¶

데이터: 성인 인구조사 소득 예측¶

라이브러리 불러오기¶

데이터 불러오기 (3개 일때)¶

EDA¶

데이터 불러오기 (2개 일때)¶

추가적인 EDA¶

주어진 데이터 합치기, 분리¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

관련글

티스토리툴바