[ML] 이진 분류 추정기의 평가

이진분류기의 평가

이진 분류를 위해 로지스틱 모델을 사용합니다. 모델 생성을 위해 사용한 데이터셋은 pima-indians-diabetes.csv로서 Kaggle에서 가져올 수 있습니다. 이 파일은 .csv 형식이므로 pandas.read_csv("경로")를 통해 호출할 수 있습니다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.preprocessing import StandardScaler, Binarizer
from sklearn.linear_model import LogisticRegression

data = pd.read_csv('pima-indians-diabetes.csv')
data.head(3)

	preg	plas	pres	skin	mass	pedi	age	class
0	6	148	72	35	33.6	0.627	50	1
1	1	85	66	29	26.6	0.351	31	0
2	8	183	64	0	23.3	0.672	32	1

데이터는 라벨은 2개의 클래스로 구성되며 그 빈도는 다음과 같습니다.

data["class"].value_counts()

class
0    500
1    268
Name: count, dtype: int64

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    768 non-null    int64  
 1   plas    768 non-null    int64  
 2   pres    768 non-null    int64  
 3   skin    768 non-null    int64  
 4   test    768 non-null    int64  
 5   mass    768 non-null    float64
 6   pedi    768 non-null    float64
 7   age     768 non-null    int64  
 8   class   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

위 결과로 자료에 Null 값은 존재하지 않음을 알 수 있습니다.

sklearn.linear_model.LogisticRegression() 클래스를 사용하여 로지스틱 모델을 생성합니다. 모델 생성전 훈련과 검증 데이터셋을 구분합니다.

Xtr, Xte, ytr, yte=train_test_split(data.iloc[:,:-1], data["class"], test_size=0.3)

lrClf=LogisticRegression(max_iter=10000, random_state=3).fit(Xtr, ytr)
predTr=lrClf.predict(Xtr)
predTr[:3]

array([0, 0, 0], dtype=int64)

위 모델을 평가하기위해 혼동행렬(confusion matrix), 정확도, 정밀도, 재현율, F1 score, ROC를 조사합니다. 이들을 계산하기 위해 다음 함수를 생성합니다.

def clf_eval(y, y_pre):
    confMat=confusion_matrix(y, y_pre)
    acc=accuracy_score(y, y_pre)
    preci=precision_score(y, y_pre)
    recall=recall_score(y, y_pre)
    f1=f1_score(y, y_pre)
    re=pd.Series([acc, preci, recall, f1], index=["accuracy","precision","recall", "F1 score"])
    return(confMat, re)

정확도, 정밀도, 재현율, F1-Score는 다음과 같습니다.

표 1. 교차표
관측\예측	P	N
P	TP	FN
N	FP	TN

정확도(Accuracy): 전체 중에 올바른 분류의 정도
정밀도(Precision): 긍정값(목표값)을 올바르게 분류하는 정도(실제 목표값 중에 예측한 목표값) 직관적으로 분류자가 음성(목표가 아닌 값)인 샘플을 양성(목표값)으로 분류하지 않는 성능을 나타냄
재현율(Recall): 긍정값(목표값)을 올바르게 분류하는 정도(예측한 목표값 중에 실제 목표값) 직관적으로 양성(목표값)을 찾는 성능을 나타냄
F1-score: 재현율과 정밀도의 조화평균입니다.

\begin{align}\text{Accuracy}&=\frac{\text{TP + TN}}{\text{TP + FP +TN + FN}}\\\tag{식 1}\text{Precision}&=\frac{\text{TP}}{\text{TP + FP}}\\\text{Recall}&=\frac{\text{TP}}{\text{TP + FN}}\\\text{F1-score}&=\frac{2\cdot\text{Precsion}\cdot\text{Recall}}{\text{Precsion}+\text{Recall}} \end{align}

조화평균(harmonic mean)은 각 값의 역수에 대한 산술평균입니다. 예를 들어 두 수 a,b의 조화평균은 식 2와 같이 계산됩니다.

$$\tag{식 2}\text{조화평균}=\frac{2}{\frac{1}{a} + \frac{1}{b}}=\frac{2ab}{a+b}$$

train_eval=clf_eval(ytr, predTr)
train_eval

(array([[313,  37],
        [ 78, 109]], dtype=int64),
 accuracy     0.785847
 precision    0.746575
 recall       0.582888
 F1 score     0.654655
 dtype: float64)

preTe=lrClf.predict(Xte)
test_eval=clf_eval(yte, preTe)
test_eval

(array([[131,  19],
        [ 32,  49]], dtype=int64),
 accuracy     0.779221
 precision    0.720588
 recall       0.604938
 F1 score     0.657718
 dtype: float64)

정밀도와 재현율을 나타내는 그래프를 생성합니다. 이분류 모델은 식 3과 같습니다.

$$ \tag{식 3} f(x_i)=\begin{cases}1 & \hat{p}(Y=1 | x_i) > c\\ 0& \text{otherwise}\end{cases}$$

즉, 식 3에서 나타낸것과 같이 실제 관측값의 클래스가 1인 조건에서 모델에 의한 x(features)에 대한 예측확률이 임계값(threshold, c)보다 클때 1 즉, P로 그렇지 않을 경우 0(N)으로 예측합니다. sklearn의 대부분의 분류 추정기는 predict_proba() 함수에 의한 0과 1에 대한 예측확률을 계산합니다. SGDClassifier()와 같은 선형모형을 기반으로 분류하는 경우 decision_function() 메서드로 그 선형모델에 의한 결과값을 반환합니다. 이 두 결과는 각 샘플에 대한 모델의 결과로서 분류 경계를 나타낼 수 있는 임계값(threshold)으로 사용할 수 있습니다. 예를 들어 precision_recall_curve(y, y_score) 함수는 한계값의 변화에 의한 계산된 정밀도와 재현율을 반환합니다. 이 함수의 y_score는 predict_proba()의 결과로부터 1이 될 확률 또는 decision_function() 의 결과가 됩니다.

함수 precision_recall_curve( y, y_score)는 precision, recall, threshold를 반환하는 데 threshold는 y_score를 올림차순으로 정렬한 것입니다.

preTrProba=lrClf.predict_proba(Xtr)
preTrProba[:3]

array([[0.88212769, 0.11787231],
       [0.7058086 , 0.2941914 ],
       [0.95943764, 0.04056236]])

ytr_score=preTrProba[:,1]
precisions, recalls, thresholds=precision_recall_curve( ytr, ytr_score)

np.sort(ytr_score)[:5]

array([0.00224104, 0.00408697, 0.00825182, 0.01150737, 0.01381425])

thresholds[:5]

array([0.00224104, 0.00408697, 0.00825182, 0.01150737, 0.01381425])

precisions.shape, recalls.shape, thresholds.shape

((538,), (538,), (537,))

plt.figure(figsize=(4,3))
limit=thresholds.shape[0]
plt.plot(thresholds, precisions[:limit], color="b", label="precision")
plt.plot(thresholds, recalls[:limit], color="r", label="recall")
plt.xlabel("threshold")
plt.ylabel('values')
plt.xticks(np.arange(0, 1.1, 0.1))
plt.yticks(np.arange(0, 1.1, 0.1))
plt.legend(loc="best")
plt.show()

위 과정을 함수로 작성하면 다음과 같습니다.

def precisionRecallFigure(y, y_score):
    precisions, recalls, thresholds=precision_recall_curve(y, y_score)
    plt.figure(figsize=(4,3))
    limit=thresholds.shape[0]
    plt.plot(thresholds, precisions[:limit], color="b", label="precision")
    plt.plot(thresholds, recalls[:limit], color="r", label="recall")
    plt.xlabel("threshold")
    plt.ylabel('values')
    plt.xticks(np.arange(0, 1.1, 0.1))
    plt.yticks(np.arange(0, 1.1, 0.1))
    plt.legend(loc="best")
    plt.show()
    re=pd.DataFrame([precisions, recalls], index=["precision", "recall"]).T
    return(re)

preTeProba=lrClf.predict_proba(Xte)
yte_score=preTeProba[:,1]
yte_eval=precisionRecallFigure(yte, yte_score)

yte_eval.head(3)

	precision	recall
0	0.350649	1.0
1	0.352174	1.0
2	0.353712	1.0

ROC curve(Receiver-Operation Chracteristic curve)는 민감도(recall)와 실제 음성중 가짜음성을 예측하는 비율(1- 특이도)를 시각화 한 것입니다. 함수 roc_curve(y, y_score, drop_intermediate=True)는 1-특이도, 민감도, 임계값를 반환합니다.

민감도(sensitivity)는 recall이라고도 하면 실제 양성중 양성을 예측하는 비율
특이도(spcificity)는 실제 음성중 음성을 예측하는 비율
FPR은 실제 음성중 가짜음성의 비율

표 1의 교차표로부터 민감도와 특이도는 식 4와 같이 계산됩니다.

\begin{align}\text{sensitivity(recall)}&=\frac{\text{TP}}{\text{TP}+\text{FN}}\\\tag{식 4}\text{spcificity}&=\frac{\text{TN}}{\text{TN}+\text{FP}}\\\text{FPR}&=1-\text{spcificity}\\&=\frac{\text{FP}}{\text{TN}+\text{FP}} \end{align}

결과중 임계값는 y_score의 내림차순으로 정렬한 것입니다. thresh[0]에서 양성(P) 예측 매우 작을 것이므로 TP 또는 FP는 매우 낮을 것입니다. 그러므로 민감도와 1 - 특이도는 올림차순으로 반환됩니다. 또한 함수 roc_curve()의 인수 drop_intermediate에 의해 일부 임계값의 제거 여부를 지정할 수 있습니다. 기본값은 True로 제거되어 반환됩니다.

fpr, sen, thresh=roc_curve(ytr, ytr_score, drop_intermediate=False)
thresh.shape, fpr.shape, sen.shape

((538,), (538,), (538,))

thresh[:10]

array([1.98998104, 0.98998104, 0.98106614, 0.9614324 , 0.9580388 ,
       0.95603086, 0.95416965, 0.95221597, 0.94948189, 0.94771266])

np.sort(ytr_score)[::-1][:10]

array([0.98998104, 0.98106614, 0.9614324 , 0.9580388 , 0.95603086,
       0.95416965, 0.95221597, 0.94948189, 0.94771266, 0.93416477])

plt.figure(figsize=(4,3))
plt.plot(fpr, sen, color="b", label="ROC")
plt.plot([0,1], [0,1], color="r", ls="dotted", label="random")
plt.xlabel("1-specificity")
plt.ylabel('sensitivy')
plt.xticks(np.arange(0, 1.1, 0.1))
plt.yticks(np.arange(0, 1.1, 0.1))
plt.legend(loc="best")
plt.show()

위 그래프의 전체 면적은 1이며 그 중에 ROC 그래프의 아래의 면적(Area Under Curve, AUR)의 비율로서 이 추정기의 평가를 측정합니다. 위 그래프는 동전던기와 같은 랜덤변수의 이진 분류에서 기대할 수 있는 AUC 값으로 0.5를 나타냅니다. 일반적으로 추정기(분류모델)은 이 선위에 존재합니다. 위 그림에서 ROC의 AUC는 함수 roc_auc_score(y, y_score)로 확인할 수 있습니다.

auc=roc_auc_score(ytr, ytr_score)
auc

0.8389915966386554

위 과정을 함수로 작성해 봅니다.

def rocFigure(y, y_score):
    fpr, sen, thresh=roc_curve(y, y_score)
    plt.figure(figsize=(4,3))
    plt.plot(fpr, sen, color="b", label="ROC")
    plt.plot([0,1], [0,1], color="r", ls="dotted", label="random")
    plt.xlabel("1-specificity")
    plt.ylabel('sensitivy')
    plt.xticks(np.arange(0, 1.1, 0.1))
    plt.yticks(np.arange(0, 1.1, 0.1))
    plt.legend(loc="best")
    plt.show()
    auc=roc_auc_score(ytr, ytr_score)
    return({"AUC":auc})

rocFigure(yte, yte_score)

{'AUC': 0.8389915966386554}

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...

sons dataStory

이 블로그 검색

벡터와 행렬에 관련된 그림들

[ML] 이진 분류 추정기의 평가

이진분류기의 평가

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

[sympy] Sympy객체의 표현을 위한 함수들

sympy.solvers로 방정식해 구하기