라이브러리 및 데이터 불러오기 그리고 심플 EDA¶
데이터: 성인 인구조사 소득 예측¶
- id : 고유의 id 값
- age : 나이
- workclass : 고용 형태
- fnlwgt : 사람 대표성을 나타내는 가중치 (final weight의 약자)
- education : 교육 수준
- education_num : 교육 수준 수치
- marital_status: 결혼 상태
- occupation : 업종
- relationship : 가족 관계
- race : 인종
- sex : 성별
- capital_gain : 양도 소득
- capital_loss : 양도 손실
- hours_per_week : 주당 근무 시간
- native_country : 국적
- income : 수익(예측해야 하는 값)
- "> 50K" : 1
- "<= 50K" : 0
라이브러리 불러오기¶
In [2]:
# 판다스 라이브러리
import pandas as pd
데이터 불러오기 (3개 일때)¶
- 3개의 데이터가 주어지는 경우
- data_atype.zip (수업노트에서 다운로드)
In [4]:
# 데이터 불러오기
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')
EDA¶
In [5]:
# 데이터 샘플(상위 5개)
X_train.head()
Out[5]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3331 | 34.0 | State-gov | 177331 | Some-college | 10 | Married-civ-spouse | Prof-specialty | Husband | Black | Male | 4386 | 0 | 40.0 | United-States |
1 | 19749 | 58.0 | Private | 290661 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40.0 | United-States |
2 | 1157 | 48.0 | Private | 125933 | Some-college | 10 | Widowed | Exec-managerial | Unmarried | Black | Female | 0 | 1669 | 38.0 | United-States |
3 | 693 | 58.0 | Private | 100313 | Some-college | 10 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 1902 | 40.0 | United-States |
4 | 12522 | 41.0 | Private | 195661 | Some-college | 10 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 54.0 | United-States |
In [6]:
# 데이터 샘플(하위 5개)
X_train.tail()
Out[6]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29299 | 15999 | 28.0 | Private | 47168 | 10th | 6 | Never-married | Machine-op-inspct | Own-child | White | Female | 0 | 0 | 40.0 | United-States |
29300 | 21604 | 44.0 | Local-gov | 231793 | Doctorate | 16 | Married-spouse-absent | Prof-specialty | Unmarried | White | Female | 0 | 0 | 38.0 | United-States |
29301 | 26839 | 41.0 | Local-gov | 201435 | HS-grad | 9 | Married-civ-spouse | Transport-moving | Husband | Black | Male | 0 | 0 | 40.0 | United-States |
29302 | 16681 | 43.0 | Private | 137722 | HS-grad | 9 | Married-civ-spouse | Handlers-cleaners | Husband | White | Male | 0 | 0 | 40.0 | United-States |
29303 | 16069 | 22.0 | Private | 406978 | Bachelors | 13 | Never-married | Exec-managerial | Other-relative | White | Female | 0 | 0 | 40.0 | United-States |
In [8]:
# 데이터 샘플(랜덤)
X_train.sample(3)
Out[8]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25338 | 13469 | 20.0 | Private | 279763 | 11th | 7 | Never-married | Craft-repair | Not-in-family | Black | Male | 0 | 0 | 25.0 | United-States |
19030 | 21822 | 45.0 | State-gov | 90803 | Some-college | 10 | Married-civ-spouse | Adm-clerical | Husband | White | Male | 0 | 0 | 40.0 | United-States |
16768 | 28390 | 36.0 | Private | 156352 | 9th | 5 | Never-married | Handlers-cleaners | Own-child | White | Male | 0 | 0 | 40.0 | United-States |
In [9]:
# 데이터 크기
X_train.shape
### 괄호 없음 주의!!!
Out[9]:
(29304, 15)
In [10]:
# 타입
X_train.info()
# float, int : 수치형
# object : 범주형
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29304 entries, 0 to 29303 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 29304 non-null int64 1 age 29292 non-null float64 2 workclass 27642 non-null object 3 fnlwgt 29304 non-null int64 4 education 29304 non-null object 5 education.num 29304 non-null int64 6 marital.status 29304 non-null object 7 occupation 27636 non-null object 8 relationship 29304 non-null object 9 race 29304 non-null object 10 sex 29304 non-null object 11 capital.gain 29304 non-null int64 12 capital.loss 29304 non-null int64 13 hours.per.week 29291 non-null float64 14 native.country 28767 non-null object dtypes: float64(2), int64(5), object(8) memory usage: 3.4+ MB
In [11]:
# 수치형 컬럼 통계값 확인(train)
X_train.describe()
# age에 -값 -> 이상치인가?
# 각 컬럼별로 통계값들을 가지고 결측치, 이상치 대략 판별 / 데이터 이해
Out[11]:
id | age | fnlwgt | education.num | capital.gain | capital.loss | hours.per.week | |
---|---|---|---|---|---|---|---|
count | 29304.000000 | 29292.000000 | 2.930400e+04 | 29304.000000 | 29304.000000 | 29304.000000 | 29291.000000 |
mean | 16264.027880 | 38.553223 | 1.897488e+05 | 10.080842 | 1093.858722 | 86.744506 | 40.434229 |
std | 9384.518323 | 13.628811 | 1.055250e+05 | 2.570824 | 7477.435640 | 401.518928 | 12.324036 |
min | 0.000000 | -38.000000 | 1.228500e+04 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 8145.750000 | 28.000000 | 1.177890e+05 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 16253.500000 | 37.000000 | 1.783765e+05 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 24374.250000 | 48.000000 | 2.370682e+05 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 32560.000000 | 90.000000 | 1.484705e+06 | 16.000000 | 99999.000000 | 4356.000000 | 99.000000 |
In [12]:
# 수치형 컬럼 통계값 확인(test)
X_test.describe()
Out[12]:
id | age | fnlwgt | education.num | capital.gain | capital.loss | hours.per.week | |
---|---|---|---|---|---|---|---|
count | 3257.000000 | 3251.000000 | 3.257000e+03 | 3257.000000 | 3257.000000 | 3257.000000 | 3248.000000 |
mean | 16423.704943 | 38.802830 | 1.900447e+05 | 10.079214 | 931.804728 | 92.336199 | 40.468288 |
std | 9535.416746 | 13.917588 | 1.057902e+05 | 2.590118 | 6496.962999 | 415.732721 | 12.598546 |
min | 3.000000 | 17.000000 | 1.882700e+04 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 8078.000000 | 28.000000 | 1.186520e+05 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 16626.000000 | 37.000000 | 1.783190e+05 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 24743.000000 | 48.000000 | 2.364360e+05 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 32559.000000 | 90.000000 | 1.033222e+06 | 16.000000 | 99999.000000 | 3900.000000 | 99.000000 |
In [15]:
# 범주형 컬럼 통계값 확인 (train)
X_train.describe(include='object')
X_train.describe(include='O') # 대문자 O
# 둘중 아무거나 해도됨
# top : 최빈값
# freq : 최빈값의 빈도수
Out[15]:
workclass | education | marital.status | occupation | relationship | race | sex | native.country | |
---|---|---|---|---|---|---|---|---|
count | 27642 | 29304 | 29304 | 27636 | 29304 | 29304 | 29304 | 28767 |
unique | 8 | 16 | 7 | 14 | 6 | 5 | 2 | 41 |
top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States |
freq | 20451 | 9449 | 13466 | 3683 | 11845 | 25022 | 19578 | 26240 |
In [17]:
# 범주형 컬럼 통계값 확인 (test)
X_test.describe(include = 'object')
# train 데이터와 test 데이터의 unique 값이 같은지 확인
# unique 값이 같아도 값이 다를 수 있지만 일단 수치 확인
# country 차이가 있음 확인
Out[17]:
workclass | education | marital.status | occupation | relationship | race | sex | native.country | |
---|---|---|---|---|---|---|---|---|
count | 3083 | 3257 | 3257 | 3082 | 3257 | 3257 | 3257 | 3211 |
unique | 8 | 16 | 7 | 14 | 6 | 5 | 2 | 37 |
top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States |
freq | 2245 | 1052 | 1510 | 457 | 1348 | 2794 | 2212 | 2930 |
In [18]:
# 결측치 (train)
X_train.isnull().sum()
Out[18]:
id 0 age 12 workclass 1662 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 1668 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 13 native.country 537 dtype: int64
In [20]:
# 결측치 (test)
X_test.isnull().sum()
Out[20]:
id 0 age 6 workclass 174 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 175 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 9 native.country 46 dtype: int64
In [21]:
# y_train 샘플 확인
y_train.head()
Out[21]:
id | income | |
---|---|---|
0 | 3331 | >50K |
1 | 19749 | <=50K |
2 | 1157 | <=50K |
3 | 693 | >50K |
4 | 12522 | <=50K |
In [23]:
# target(label)별 개수 확인
y_train['income'].value_counts()
Out[23]:
<=50K 22263 >50K 7041 Name: income, dtype: int64
In [24]:
# 상관관계
X_train.corr()
# 시험에서는 시각화를 지원하지 않아 이렇게 상관관계를 파악
C:\Users\Master\AppData\Local\Temp\ipykernel_16564\3366806260.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. X_train.corr()
Out[24]:
id | age | fnlwgt | education.num | capital.gain | capital.loss | hours.per.week | |
---|---|---|---|---|---|---|---|
id | 1.000000 | -0.077323 | 0.003550 | -0.090382 | -0.221315 | -0.358569 | -0.056766 |
age | -0.077323 | 1.000000 | -0.075860 | 0.035072 | 0.077384 | 0.059372 | 0.068504 |
fnlwgt | 0.003550 | -0.075860 | 1.000000 | -0.039059 | 0.000265 | -0.009576 | -0.018948 |
education.num | -0.090382 | 0.035072 | -0.039059 | 1.000000 | 0.125894 | 0.079391 | 0.147519 |
capital.gain | -0.221315 | 0.077384 | 0.000265 | 0.125894 | 1.000000 | -0.031605 | 0.081179 |
capital.loss | -0.358569 | 0.059372 | -0.009576 | 0.079391 | -0.031605 | 1.000000 | 0.053508 |
hours.per.week | -0.056766 | 0.068504 | -0.018948 | 0.147519 | 0.081179 | 0.053508 | 1.000000 |
데이터 불러오기 (2개 일때)¶
- 2개의 데이터가 주어지는 경우
- data_btype.zip (수업노트에서 다운로드)
In [27]:
# 데이터 불러오기
train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_btype\train.csv')
test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_btype\test.csv')
In [31]:
# 데이터 크기
train.shape, test.shape
# 시험에서 데이터가 2개가 주어질지 3개가 주어질지 모름 -> 모두 연습 필요
Out[31]:
((29304, 16), (3257, 15))
In [29]:
# 데이터 샘플
train.sample(3)
# income이 3개일때랑 다르게 train에 함께 들어가 있음
Out[29]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19682 | 26519 | 21.0 | Private | 89991 | Some-college | 10 | Never-married | Adm-clerical | Own-child | White | Female | 0 | 0 | 40.0 | United-States | <=50K |
20395 | 410 | 52.0 | Private | 139671 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1977 | 50.0 | United-States | >50K |
3233 | 3149 | 52.0 | State-gov | 135388 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 5013 | 0 | 40.0 | United-States | <=50K |
In [32]:
# 예측 해야할 값
train['income'].value_counts()
Out[32]:
<=50K 22263 >50K 7041 Name: income, dtype: int64
In [33]:
# 조건
cond0 = train['income'] == '<=50K' # 0 하위 소득
cond1 = train['income'] == '>50K' # 1 상위 소득
cond_female = train['sex'] == 'Female'
cond_male = train['sex'] == "Male"
In [35]:
# 남성과 여성의 수
len(train[cond_male]), len(train[cond_female])
Out[35]:
(19578, 9726)
In [36]:
# 남성 중 0과 1 (인원 수)
len(train[cond0&cond_male]), len(train[cond1&cond_male])
Out[36]:
(13602, 5976)
In [37]:
# 남성 중 0과 1 (비율)
len(train[cond0&cond_male])/len(train[cond_male]), len(train[cond1&cond_male])/len(train[cond_male])
Out[37]:
(0.6947594238430892, 0.3052405761569108)
In [38]:
# 여성 중 0과 1 (인원 수)
len(train[cond0&cond_female]), len(train[cond1&cond_female])
Out[38]:
(8661, 1065)
In [39]:
# 여성 중 0과 1 (비율)
len(train[cond0&cond_female])/len(train[cond_female]), len(train[cond1&cond_female])/len(train[cond_female])
Out[39]:
(0.8904996915484269, 0.1095003084515731)
추가적인 EDA¶
- 성별에 따른 소득 수준
- 결혼 유무에 따른 소득 수준
- 직업에 따른 소득 수준
- 가족 관계에 따른 소득 수준
- 국적/백인에 따른 소득 수준 (불균형 미국, 백인 데이터가 대부분임)
주어진 데이터 합치기, 분리¶
In [ ]:
# 데이터를 자유롭게 사용하기 위해 합치거나 분리함
In [40]:
# 데이터 크기
train.shape, X_train.shape, y_train.shape
# X_train과 y_train을 합쳤을때 train데이터와 컬럼수가 맞지 않음
Out[40]:
((29304, 16), (29304, 15), (29304, 2))
In [42]:
# 데이터 확인
y_train.head(1)
# y_train에 id컬럼이 중복으로 들어가있음
Out[42]:
id | income | |
---|---|---|
0 | 3331 | >50K |
In [43]:
# X_train y_train 합치는 것 예시
df = pd.concat([X_train, y_train['income']], axis=1)
df.shape
Out[43]:
(29304, 16)
In [44]:
# 데이터 확인
df.head()
Out[44]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3331 | 34.0 | State-gov | 177331 | Some-college | 10 | Married-civ-spouse | Prof-specialty | Husband | Black | Male | 4386 | 0 | 40.0 | United-States | >50K |
1 | 19749 | 58.0 | Private | 290661 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40.0 | United-States | <=50K |
2 | 1157 | 48.0 | Private | 125933 | Some-college | 10 | Widowed | Exec-managerial | Unmarried | Black | Female | 0 | 1669 | 38.0 | United-States | <=50K |
3 | 693 | 58.0 | Private | 100313 | Some-college | 10 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 1902 | 40.0 | United-States | >50K |
4 | 12522 | 41.0 | Private | 195661 | Some-college | 10 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 54.0 | United-States | <=50K |
In [45]:
# train 분리 예시
X_tr = train.iloc[:,:-1].copy()
y_tr = train.iloc[:, [0,-1]].copy()
X_tr.shape, y_tr.shape
# copy를 써주지 않으면 warning이 뜰수있음 - 복사시 쓰는 것을 습관들이기
Out[45]:
((29304, 15), (29304, 2))
In [46]:
# 데이터 확인
y_tr.head()
Out[46]:
id | income | |
---|---|---|
0 | 3331 | >50K |
1 | 19749 | <=50K |
2 | 1157 | <=50K |
3 | 693 | >50K |
4 | 12522 | <=50K |
In [47]:
X_tr.head()
Out[47]:
id | age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3331 | 34.0 | State-gov | 177331 | Some-college | 10 | Married-civ-spouse | Prof-specialty | Husband | Black | Male | 4386 | 0 | 40.0 | United-States |
1 | 19749 | 58.0 | Private | 290661 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40.0 | United-States |
2 | 1157 | 48.0 | Private | 125933 | Some-college | 10 | Widowed | Exec-managerial | Unmarried | Black | Female | 0 | 1669 | 38.0 | United-States |
3 | 693 | 58.0 | Private | 100313 | Some-college | 10 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 1902 | 40.0 | United-States |
4 | 12522 | 41.0 | Private | 195661 | Some-college | 10 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 54.0 | United-States |
In [ ]: