이진 분류 – 분류¶
안에 ( ):
import warnings
warnings.filterwarnings('ignore')
실습을 위한 데이터 구조¶
- 팬더의 DataFrame
- Default.csv
안에 ( ):
import pandas as pd
DF = pd.read_csv('https://raw.githubusercontent.com/rusita-ai/pyData/master/Default.csv')
DF.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 default 10000 non-null object
1 student 10000 non-null object
2 balance 10000 non-null float64
3 income 10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB
안에 ( ):
DF.head()
밖():
| 기본 | 대학생 | 균형 | 소득 | |
|---|---|---|---|---|
| 0 | 아니요 | 아니요 | 729.526495 | 44361.62507 |
| 하나 | 아니요 | 예 | 817.180407 | 12106.13470 |
| 2 | 아니요 | 아니요 | 1073.549164 | 31767.13895 |
| 삼 | 아니요 | 아니요 | 529.250605 | 35704.49394 |
| 4 | 아니요 | 아니요 | 785.655883 | 38463.49588 |
I. 탐색적 데이터 분석¶
1) 주파수 분석¶
안에 ( ):
DF.default.value_counts()
밖():
No 9667
Yes 333
Name: default, dtype: int64
2) 분포 가시화¶
안에 ( ):
import matplotlib.pyplot as plt
plt.figure(figsize = (9, 6))
plt.boxplot((DF(DF.default == 'No').balance,
DF(DF.default == 'Yes').balance),
labels = ('No', 'Yes'))
plt.show()
II.데이터 전처리¶
1) X,y¶
안에 ( ):
X = DF(('balance'))
y = DF('default')
2) 교육 및 테스트 분할¶
- 7:3
안에 ( ):
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 2045)
print('Train Data : ', X_train.shape, y_train.shape)
print('Test Data : ', X_test.shape, y_test.shape)
Train Data : (7000, 1) (7000,)
Test Data : (3000, 1) (3000,)
III. 모델¶
1) Train_Data를 사용하여 모델 생성¶
안에 ( ):
from sklearn.linear_model import LogisticRegression
Model_lr = LogisticRegression()
Model_lr.fit(X_train, y_train)
밖():
LogisticRegression()
2) Test_Data에 모델 적용¶
안에 ( ):
y_hat = Model_lr.predict(X_test)
- y_hat
안에 ( ):
y_hat
밖():
array(('No', 'No', 'No', ..., 'No', 'No', 'No'), dtype=object)
IV. 모델 확인¶
1) 혼란 행렬¶
- “아니오” 기준(환불).
안에 ( ):
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_hat)
밖():
array(((2888, 8),
( 72, 32)))
- 예(연체) 기준
안에 ( ):
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_hat, labels = ('Yes','No'))
밖():
array((( 32, 72),
( 8, 2888)))
2) 정확성, 정밀도, 재현율 – “아니오”¶
안에 ( ):
from sklearn.metrics import accuracy_score, precision_score, recall_score
print(accuracy_score(y_test, y_hat))
print(precision_score(y_test, y_hat, pos_label="No"))
print(recall_score(y_test, y_hat, pos_label="No"))
0.9733333333333334
0.9756756756756757
0.9972375690607734
3) 정확도, 정밀도, 리콜 – “예”¶
안에 ( ):
from sklearn.metrics import accuracy_score, precision_score, recall_score
print(accuracy_score(y_test, y_hat))
print(precision_score(y_test, y_hat, pos_label="Yes"))
print(recall_score(y_test, y_hat, pos_label="Yes"))
0.9733333333333334
0.8
0.3076923076923077
4) F1_Score – ‘아니오(환불)’¶
안에 ( ):
from sklearn.metrics import f1_score
f1_score(y_test, y_hat, pos_label="No")
밖():
0.9863387978142076
5) F1_Score – “예(연체)”¶
안에 ( ):
from sklearn.metrics import f1_score
f1_score(y_test, y_hat, pos_label="Yes")
밖():
0.4444444444444444
6) 분류보고¶
안에 ( ):
from sklearn.metrics import classification_report
print(classification_report(y_test, y_hat,
target_names = ('No', 'Yes'),
digits = 5))
precision recall f1-score support
No 0.97568 0.99724 0.98634 2896
Yes 0.80000 0.30769 0.44444 104
accuracy 0.97333 3000
macro avg 0.88784 0.65246 0.71539 3000
weighted avg 0.96959 0.97333 0.96755 3000