데이터 : data_atype.zip (수업노트에서 다운로드)

머신러닝¶

문제정의, 라이브러리/데이터 불러오기
탐색적 데이터 분석 (EDA)
데이터 전처리
피처엔지니어링
(Train/Validation 나누기)
모델 선택/훈련/평가/최적화
예측
(csv 생성)

1. 베이스라인¶

문제정의, 라이브러리 및 데이터 불러오기
데이터 전처리 (단순 일괄 처리)
모델 선택, 훈련
평가

2. 베이스라인¶

훈련/검증용 데이터 분리
모델 선택, 훈련
- 의사결정나무
- 랜덤포레스트
- XGBoost
평가

문제1¶

"<= 50K -> 0"
"> 50K -> 1"
평가: 정확도

In [2]:

# 라이브러리 및 데이터 불러오기
import pandas as pd
X_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_train.csv')
y_train = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\y_train.csv')
X_test = pd.read_csv(r'C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype\X_test.csv')

In [4]:

# 데이터 크기
X_train.shape, X_test.shape, y_train.shape

Out[4]:

((29304, 15), (3257, 15), (29304, 2))

In [5]:

# 데이터 샘플
X_train.head()

Out[5]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.gain	capital.loss	hours.per.week	native.country
0	3331	34.0	State-gov	177331	Some-college	10	Married-civ-spouse	Prof-specialty	Husband	Black	Male	4386	0	40.0	United-States
1	19749	58.0	Private	290661	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40.0	United-States
2	1157	48.0	Private	125933	Some-college	10	Widowed	Exec-managerial	Unmarried	Black	Female	0	1669	38.0	United-States
3	693	58.0	Private	100313	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	0	1902	40.0	United-States
4	12522	41.0	Private	195661	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	54.0	United-States

In [8]:

# 타겟 수 확인
# 타겟 컬럼을 모르기때문에 일단 모든 컬럼을 불러와 확인
y_train.head()

Out[8]:

	id	income
0	3331	>50K
1	19749	<=50K
2	1157	<=50K
3	693	>50K
4	12522	<=50K

In [7]:

# 문제에서 지정한 데이터가 'income'이라는 것을 확인
y_train['income'].value_counts()

Out[7]:

<=50K    22263
>50K      7041
Name: income, dtype: int64

In [9]:

# type확인
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29304 entries, 0 to 29303
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              29304 non-null  int64  
 1   age             29292 non-null  float64
 2   workclass       27642 non-null  object 
 3   fnlwgt          29304 non-null  int64  
 4   education       29304 non-null  object 
 5   education.num   29304 non-null  int64  
 6   marital.status  29304 non-null  object 
 7   occupation      27636 non-null  object 
 8   relationship    29304 non-null  object 
 9   race            29304 non-null  object 
 10  sex             29304 non-null  object 
 11  capital.gain    29304 non-null  int64  
 12  capital.loss    29304 non-null  int64  
 13  hours.per.week  29291 non-null  float64
 14  native.country  28767 non-null  object 
dtypes: float64(2), int64(5), object(8)
memory usage: 3.4+ MB

In [13]:

# 수치형 데이터
n_train = X_train[X_train.columns[X_train.dtypes != 'object'][1:]]  #  id도 제거
n_train.head()

Out[13]:

	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
0	34.0	177331	10	4386	0	40.0
1	58.0	290661	9	0	0	40.0
2	48.0	125933	10	0	1669	38.0
3	58.0	100313	10	0	1902	40.0
4	41.0	195661	10	0	0	54.0

In [17]:

n_test = X_test[X_test.columns[X_test.dtypes != 'object'][1:]]  #  id도 제거
n_test.head()

Out[17]:

	age	fnlwgt	education.num	hours.per.week
0	39.0	114055	13	40.0
1	38.0	254114	10	40.0
2	44.0	55395	9	NaN
3	47.0	28035	13	50.0
4	62.0	186611	9	40.0

In [14]:

# 수치형 데이처 통계
n_train.describe()

Out[14]:

	age	fnlwgt	education.num	capital.gain	capital.loss	hours.per.week
count	29292.000000	2.930400e+04	29304.000000	29304.000000	29304.000000	29291.000000
mean	38.553223	1.897488e+05	10.080842	1093.858722	86.744506	40.434229
std	13.628811	1.055250e+05	2.570824	7477.435640	401.518928	12.324036
min	-38.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.177890e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.783765e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.370682e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

In [15]:

# 결측값
n_train.isnull().sum()

Out[15]:

age               12
fnlwgt             0
education.num      0
capital.gain       0
capital.loss       0
hours.per.week    13
dtype: int64

In [18]:

# 간단한 결측치 처리
### 반드시 test데이터도 같이 처리
n_train = n_train.fillna(0)
n_test = n_test.fillna(0)

In [19]:

# 결측치 확인
n_train.isnull().sum(), n_test.isnull().sum()

Out[19]:

(age               0
 fnlwgt            0
 education.num     0
 capital.gain      0
 capital.loss      0
 hours.per.week    0
 dtype: int64,
 age               0
 fnlwgt            0
 education.num     0
 capital.gain      0
 capital.loss      0
 hours.per.week    0
 dtype: int64)

In [ ]:

# 베이스 라인에서는 그외 전처리 및 피처 엔지니어링 생략

In [24]:

# target값 변경
# <=50K -> 0
# >50K -> 1
y_train['income'] = y_train['income'].replace('<=50K', 0)
y_train['income'] = y_train['income'].replace('>50K', 1)
y_train.head()

Out[24]:

	id	income
0	3331	1
1	19749	0
2	1157	0
3	693	1
4	12522	0

In [ ]:

# 정답
y = (y_train['income'] == '>50K').astype(int)
y[:3]

In [ ]:

# 데이터 확인

머신러닝 모델¶

In [31]:

# 랜덤포레스트
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(n_train, y_train['income'])
pred = rf.predict(n_test)
# 이렇게만 하면 끝임

In [ ]:

# 데이터 크기 확인
len(X_test)

In [36]:

# 예측 및 csv 파일 생성
pred[:10]
submit = pd.DataFrame(
            {
                'id':X_test['id'],
                'income':pred
            })
submit.to_csv('00000.csv', index = False)
### index=False를 안하면 인덱스도 데이터로 들어가기때문에 꼭 해주기!!
# 주어진 예시대로 작성해서 저장하기

In [ ]:

# 데이터 확인(y_train)

In [37]:

# 평가 (수험자는 알 수 없는 부분임) accuracy
from sklearn.metrics import accuracy_score
y_test = pd.read_csv(r"C:\Users\Master\Desktop\데이터자격증\빅분기_실기\퇴근후딴짓\data_atype_y\y_test.csv")
ans = (y_test['income'] == '>50K').astype(int)
accuracy_score(ans, pred)
# accuracy_score(실제값, 예측값)

Out[37]:

0.8087196806877495

위 부분은 수험자가 할 수 없기때문에 검증데이터를 따로 만들어서 자체평가한다¶

문제2¶

"<= 50K -> 0"
"> 50K -> 1"
평가: roc_auc 예측 해야할 값은 : 확률

검증용 데이터 분리¶

In [88]:

# 학습용 데이터와 검증용 데이터로 구분
from sklearn.model_selection import train_test_split
#train_test_split(X_train, y_train, test_size=0.1, random_state=2022)
# random_state는 난수시드설정
X_tr, X_val, y_tr, y_val = train_test_split(n_train, y_train.iloc[:,1], test_size=0.1, random_state=2022)
# 변수들 순서 기억

In [89]:

# 데이터 크기
X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

Out[89]:

((26373, 6), (2931, 6), (26373,), (2931,))

In [90]:

# 의사결정나무
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_tr, y_tr)
pred1 = dt.predict_proba(X_val)
# 확률값을 예측할때는 predict_proba
pred1
# 리스트 원소값 각각 0일확률, 1일확률 ( '<=50K'일 확률, '>50K'일 확률)

Out[90]:

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [91]:

# 랜덤포레스트
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_tr, y_tr)
pred2 = rf.predict_proba(X_val)
# 확률값을 예측할때는 predict_proba
pred2
# 리스트 원소값 각각 0일확률, 1일확률 ( '<=50K'일 확률, '>50K'일 확률)

Out[91]:

array([[0.96, 0.04],
       [1.  , 0.  ],
       [0.56, 0.44],
       ...,
       [0.92, 0.08],
       [0.98, 0.02],
       [0.01, 0.99]])

In [92]:

!pip install xgboost

Requirement already satisfied: xgboost in c:\users\master\anaconda3\lib\site-packages (2.0.0)
Requirement already satisfied: scipy in c:\users\master\anaconda3\lib\site-packages (from xgboost) (1.10.0)
Requirement already satisfied: numpy in c:\users\master\anaconda3\lib\site-packages (from xgboost) (1.23.5)

In [93]:

# XGBoost
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_tr, y_tr)
pred3 = xgb.predict_proba(X_val)
# 확률값을 예측할때는 predict_proba
pred3[:,1]
# 리스트 원소값 각각 0일확률, 1일확률 ( '<=50K'일 확률, '>50K'일 확률)

Out[93]:

array([0.03814121, 0.00552378, 0.47223645, ..., 0.15650894, 0.03893219,
       0.99036294], dtype=float32)

In [95]:

# 평가 데이터로 예측 및 csv파일 생성
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_val, pred1[:,1]))
print(roc_auc_score(y_val, pred2[:,1]))
print(roc_auc_score(y_val, pred3[:,1]))

0.6981986068117305
0.8495374307532011
0.8859038904196574

In [96]:

pred3 = xgb.predict_proba(n_test)
submit = pd.DataFrame(
            {
                'id':X_test['id'],
                'income':pred3[:,1]
            })
submit.to_csv('1111.csv', index = False)

평가¶

수험자는 알 수 없는 영역임

In [ ]:

from sklearn.metrics import roc_auc_score
y_test = pd.read_csv("y_test.csv")
ans = (y_test['income'] != '<=50K').astype(int)
roc_auc_score(ans, pred[:,1])

In [ ]:

작업형1 모의문제1 (0)	2023.11.06
머신러닝_회귀모델(수치형 데이터) (0)	2023.11.06
피처엔지니어링 (0)	2023.11.06
데이터 전처리 (0)	2023.11.06
데이터 불러오기 및 EDA (0)	2023.11.06

Recording me

머신러닝_분류모델(범주형 데이터)

머신러닝¶

1. 베이스라인¶

2. 베이스라인¶

문제1¶

머신러닝 모델¶

위 부분은 수험자가 할 수 없기때문에 검증데이터를 따로 만들어서 자체평가한다¶

문제2¶

검증용 데이터 분리¶

평가¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

티스토리툴바

머신러닝_분류모델(범주형 데이터)

머신러닝¶

1. 베이스라인¶

2. 베이스라인¶

문제1¶

머신러닝 모델¶

위 부분은 수험자가 할 수 없기때문에 검증데이터를 따로 만들어서 자체평가한다¶

문제2¶

검증용 데이터 분리¶

평가¶

'빅데이터 분석 기사 공부' 카테고리의 다른 글

관련글

티스토리툴바