Table of Contents

机器学习05-分类模型评估方法#

数据prepare#

数据集包含 10 列特征,以及一列类别标签。其中:

  • 第 1 ~ 6 列为客户近期历史账单信息。(特征)

  • 第 7 列为该客户年龄。(特征)

  • 第 8 列为该客户性别。(特征)

  • 第 9 列为该客户教育程度。(特征)

  • 第 10 列为该客户婚姻状况。(特征)

  • 第 11 列为客户持卡风险状况。(分类标签:LOW, HIGH)

import pandas as pd

df = pd.read_csv("../data/credit_risk_train.csv")  # 读取数据文件
df.head()
BILL_1 BILL_2 BILL_3 BILL_4 BILL_5 BILL_6 AGE SEX EDUCATION MARRIAGE RISK
0 0 0 0 0 0 0 37 Female Graduate School Married LOW
1 8525 5141 5239 7911 17890 10000 25 Male High School Single HIGH
2 628 662 596 630 664 598 39 Male Graduate School Married HIGH
3 4649 3964 3281 934 467 12871 41 Female Graduate School Single HIGH
4 46300 10849 8857 9658 9359 9554 55 Female High School Married HIGH
df["RISK"].unique()
array(['LOW', 'HIGH'], dtype=object)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

# 将分类标签替换为数值,方便后面计算
df.RISK = df.RISK.replace({"LOW": 0, "HIGH": 1})

# 获取特征数据列
train_data = df.iloc[:, :-1]

# 字符串类型的特征 => 独热编码
train_data = pd.get_dummies(train_data)

# 规范化处理
train_data = scale(train_data)

# label
train_target = df['RISK']

# 划分数据集,训练集占 70%,测试集占 30%
X_train, X_test, y_train, y_test = train_test_split(
    train_data, train_target, test_size=0.3, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
((14000, 16), (6000, 16), (14000,), (6000,))

逻辑回归算法实现#

from sklearn.linear_model import LogisticRegression

# 定义逻辑回归模型
model = LogisticRegression(solver='lbfgs')

# 使用训练数据完成模型训练
model.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = model.predict(X_test)  # 输入测试集特征数据得到预测结果
y_pred
array([1, 1, 1, ..., 1, 1, 1])

混淆矩阵#

定义正类和负类,例如这里我们定 HIGH 为 正类,LOW 为负类,可以得到以下混淆矩阵:

信用风险

HIGH

LOW

HIGH

True Positive (TP)

False Negative (FN)

LOW

False Positive (FP)

True Negative (TN)

上表含义:

  • TP:将正类预测为正类数 → 预测正确

  • TN:将负类预测为负类数 → 预测正确

  • FP:将负类预测为正类数 → 预测错误

  • FN:将正类预测为负类数 → 预测错误

准确率(Accuracy)#

\[ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \]
import numpy as np

def get_accuracy(test_labels, pred_lables):
    # 准确率计算公式,根据公式 2 实现
    correct = np.sum(test_labels == pred_lables)  # 计算预测正确的数据个数
    n = len(test_labels)  # 总测试集数据个数
    acc = correct / n
    return acc

get_accuracy(y_test, y_pred)
0.7678333333333334
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)  # 传入真实类别和预测类别
0.7678333333333334
model.score(X_test, y_test)
0.7678333333333334

精确率(Precison)#

正确分类的正例个数占预估为正例的总数的比例。

\[ Precison = \frac{TP}{TP+FP} \]
from sklearn.metrics import precision_score

precision_score(y_test, y_pred)
0.7678333333333334

召回率(Recall)#

正确分类的正例个数占实际正例总数的比例。 $\( Recall = \frac{TP}{TP+FN} \)$

from sklearn.metrics import recall_score

recall_score(y_test, y_pred)
1.0

F1值#

F1值是召回率和准确率的加权平均数。 $\( F1 = \frac{2 \cdot Precison \cdot Rrecall} {Precison + Recall} \)$

from sklearn.metrics import f1_score

f1_score(y_test, y_pred)
0.8686716319411709

ROC曲线#

分类模型中,通常会设定一个阈值,并规定大于该阈值为正类,小于则为负类。所以,当我们减小阀值时,将会有更多的样本被划分到正类。这样会提高正类的识别率,但同时也会使得更多的负类被错误识别为正类。ROC 曲线的目的在用形象化该变化过程,从而评价一个分类器好坏。

横轴: $\( FPR = \frac{FP}{FP+TN} \)$

纵轴: $\( TPR = \frac{TP}{TP+FN} \)$

代码示例:

import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
%matplotlib inline


# 生成一些示例数据
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])

# 计算ROC曲线的参数
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

# 绘制ROC曲线
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
../_images/8a5e51d97eed1cd1c9105a49f72afe6cdba52b107f0a9bd3b6518ea9314bb8d1.png
from sklearn.metrics import roc_curve, auc
from matplotlib import pyplot as plt
%matplotlib inline


# 计算样本打分
y_score = model.decision_function(X_test)

fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
<matplotlib.legend.Legend at 0x13c89e580>
../_images/0a78437689dcdfd60d13f5f7ed1e08d77914f531b495d568945bc8bb9ddddb35.png

AUC计算#

工业界数据量通常都比较大,常用的计算AUC的方式如下(阈值为0.5):

def auc(data):
    """计算auc"""
    data_sort = sorted(data.items(), key=lambda x: x[0], reverse=True)
    ack = [x[1][0] for x in data_sort]
    clk = [x[1][1] for x in data_sort]
    sample_num = sum(ack)
    pos = sum(clk)
    neg = sample_num - pos
    if pos < 1 or neg < 1:
        return 0
    roc_arr = []
    tp = fp = 0
    for i, j in zip(ack, clk):
        tp += j
        fp += (i - j)
        roc_arr.append((float(fp) / neg, float(tp) / pos))
    auc = 0
    prev_x = 0
    for x, y in roc_arr:
        auc += (x - prev_x) * y
        prev_x = x
    return round(auc, 5)

# 打分分桶: [到达量, 点击量]
data = {
    "0.1000": [1, 0],
    "0.4000": [1, 0],
    "0.3500": [1, 1],
    "0.8000": [1, 1]
}

auc(data)
0.75