公众号:尤而小屋 作者:Peter 编辑:Peter
大家好,我是Peter~
本文是UCI金融信贷数据集的第二篇文章:基于LightGBM的二分类建模。主要内容包含:
第一步还是导入数据处理和建模所需要的各种库:
In 1:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 100)
from IPython.display import display_html
import plotly_express as px
import plotly.graph_objects as go
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] # 设置字体
plt.rcParams["axes.unicode_minus"]=False # 解决“-”负号的乱码问题
import seaborn as sns
%matplotlib inline
import missingno as ms
import gc
from datetime import datetime
from sklearn.model_selection import train_test_split,StratifiedKFold,GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from imblearn.under_sampling import ClusterCentroids
from imblearn.over_sampling import KMeansSMOTE, SMOTE
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, auc
from sklearn.metrics import roc_auc_score,precision_recall_curve, confusion_matrix,classification_report
# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from pydotplus import graph_from_dot_data
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
In 2:
df = pd.read_csv("UCI.csv")
df.head()
Out2:
1、整体数据量
整理的数据量大小:30000条记录,25个字段信息
In 3:
df.shape
Out3:
(30000, 25)
2、数据字段信息
In 4:
df.columns # 全部的字段名
Out4:
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
'default.payment.next.month'],
dtype='object')
不同的字段类型统计:
In 5:
df.dtypes # 查看数据的字段类型
Out5:
ID int64
LIMIT_BAL float64
SEX int64
EDUCATION int64
MARRIAGE int64
AGE int64
PAY_0 int64
PAY_2 int64
PAY_3 int64
PAY_4 int64
PAY_5 int64
PAY_6 int64
BILL_AMT1 float64
BILL_AMT2 float64
BILL_AMT3 float64
BILL_AMT4 float64
BILL_AMT5 float64
BILL_AMT6 float64
PAY_AMT1 float64
PAY_AMT2 float64
PAY_AMT3 float64
PAY_AMT4 float64
PAY_AMT5 float64
PAY_AMT6 float64
default.payment.next.month int64
dtype: object
In 6:
pd.value_counts(df.dtypes) # 统计不同类型的个数
Out6:
float64 13
int64 12
Name: count, dtype: int64
从结果中能够看到全部是数值型字段,几乎各占一半。最后一个字段default.payment.next.month
是我们最终的目标字段。
下面对字段名称的具体含义进行解释:
说明内容:
3、数据的描述统计信息(展示部分字段)
In 7:
df.describe().T # 字段较多,转置后显示更直观
4、字段整体信息
In 8:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 30000 non-null int64
1 LIMIT_BAL 30000 non-null float64
2 SEX 30000 non-null int64
3 EDUCATION 30000 non-null int64
4 MARRIAGE 30000 non-null int64
5 AGE 30000 non-null int64
6 PAY_0 30000 non-null int64
7 PAY_2 30000 non-null int64
8 PAY_3 30000 non-null int64
9 PAY_4 30000 non-null int64
10 PAY_5 30000 non-null int64
11 PAY_6 30000 non-null int64
12 BILL_AMT1 30000 non-null float64
13 BILL_AMT2 30000 non-null float64
14 BILL_AMT3 30000 non-null float64
15 BILL_AMT4 30000 non-null float64
16 BILL_AMT5 30000 non-null float64
17 BILL_AMT6 30000 non-null float64
18 PAY_AMT1 30000 non-null float64
19 PAY_AMT2 30000 non-null float64
20 PAY_AMT3 30000 non-null float64
21 PAY_AMT4 30000 non-null float64
22 PAY_AMT5 30000 non-null float64
23 PAY_AMT6 30000 non-null float64
24 default.payment.next.month 30000 non-null int64
dtypes: float64(13), int64(12)
memory usage: 5.7 MB
为了数据处理方便,将原始的default.payment.next.month字段重新命名成Label:
In 9:
df.rename(columns={"default.payment.next.month":"Label"},inplace=True)
统计每个字段的缺失值个数:
In 10:
df.isnull().sum().sort_values(ascending=False)
Out10:
ID 0
BILL_AMT2 0
PAY_AMT6 0
PAY_AMT5 0
PAY_AMT4 0
PAY_AMT3 0
PAY_AMT2 0
PAY_AMT1 0
BILL_AMT6 0
BILL_AMT5 0
BILL_AMT4 0
BILL_AMT3 0
BILL_AMT1 0
LIMIT_BAL 0
PAY_6 0
PAY_5 0
PAY_4 0
PAY_3 0
PAY_2 0
PAY_0 0
AGE 0
MARRIAGE 0
EDUCATION 0
SEX 0
Label 0
dtype: int64
In 11:
# 缺失值个数
total = df.isnull().sum().sort_values(ascending=False)
In 12:
# 缺失值比例
percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False)
percent
Out12:
ID 0.0
BILL_AMT2 0.0
PAY_AMT6 0.0
PAY_AMT5 0.0
PAY_AMT4 0.0
PAY_AMT3 0.0
PAY_AMT2 0.0
PAY_AMT1 0.0
BILL_AMT6 0.0
BILL_AMT5 0.0
BILL_AMT4 0.0
BILL_AMT3 0.0
BILL_AMT1 0.0
LIMIT_BAL 0.0
PAY_6 0.0
PAY_5 0.0
PAY_4 0.0
PAY_3 0.0
PAY_2 0.0
PAY_0 0.0
AGE 0.0
MARRIAGE 0.0
EDUCATION 0.0
SEX 0.0
Label 0.0
dtype: float64
将个数和比例的合并,显示完整的缺失值信息:
In 13:
pd.concat([total, percent],axis=1,keys=["Total","Percent"]).T
In 14:
ms.bar(df,color="blue")
plt.show()
坐标轴标签的旋转:
In 15:
# ms.matrix(df, labels=True,label_rotation=45)
# plt.show()
下面进行不同字段的详细数据探索过程:
In 16:
df.columns
Out16:
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
dtype='object')
ID字段对建模无效,直接删除:
In 17:
df.drop("ID",inplace=True,axis=1)
查看用户的个人信息,比如信用额度、学历、婚姻状态、年龄等字段:
In 18:
df[['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'AGE']].describe()
Out18:
LIMIT_BAL | EDUCATION | MARRIAGE | AGE | |
---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 167484.322667 | 1.853133 | 1.551867 | 35.485500 |
std | 129747.661567 | 0.790349 | 0.521970 | 9.217904 |
min | 10000.000000 | 0.000000 | 0.000000 | 21.000000 |
25% | 50000.000000 | 1.000000 | 1.000000 | 28.000000 |
50% | 140000.000000 | 2.000000 | 2.000000 | 34.000000 |
75% | 240000.000000 | 2.000000 | 2.000000 | 41.000000 |
max | 1000000.000000 | 6.000000 | 3.000000 | 79.000000 |
In 19:
df["EDUCATION"].value_counts().sort_values(ascending=False)
Out19:
EDUCATION
2 14030
1 10585
3 4917
5 280
4 123
6 51
0 14
Name: count, dtype: int64
用户的学历中出现最多的是本科生EDUCATION=2
In 20:
df["MARRIAGE"].value_counts().sort_values(ascending=False)
Out20:
MARRIAGE
2 15964
1 13659
3 323
0 54
Name: count, dtype: int64
用户的婚姻状态中,出现最多的是MARRIAGE=2,已婚人群。
LIMIT_BAL的分布
In 21:
df["LIMIT_BAL"].value_counts().sort_values(ascending=False)
Out21:
LIMIT_BAL
50000.0 3365
20000.0 1976
30000.0 1610
80000.0 1567
200000.0 1528
...
800000.0 2
1000000.0 1
327680.0 1
760000.0 1
690000.0 1
Name: count, Length: 81, dtype: int64
可以看到信用额度最为频繁的是50,000
In 22:
plt.figure(figsize = (14,6))
plt.title('Density Plot of LIMIT_BAL')
sns.set_color_codes("pastel")
sns.distplot(df['LIMIT_BAL'],kde=True,bins=200)
plt.show()
每月之前的对应还款状态:
In 23:
df[["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]].describe()
Out23:
PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | -0.291100 |
std | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | 1.149988 |
min | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 |
25% | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 |
不同还款状态的对比:
In 24:
repay = df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'Label']]
repay = pd.melt(repay,
id_vars="Label",
var_name="Payment Status",
value_name="Delay(Month)"
)
repay.head()
Out24:
Label | Payment Status | Delay(Month) | |
---|---|---|---|
0 | 1 | PAY_0 | 2 |
1 | 1 | PAY_0 | -1 |
2 | 0 | PAY_0 | 0 |
3 | 0 | PAY_0 | 0 |
4 | 0 | PAY_0 | -1 |
In 25:
fig = px.box(repay, x="Payment Status", y="Delay(Month)",color="Label")
fig.show()
当月的账单金额
In 26:
df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()
Out26:
是否违约客户的对比:
In 27:
df.columns
Out27:
Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
dtype='object')
In 28:
BILL_AMTS = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
plt.figure(figsize=(12,6))
for i, col in enumerate(BILL_AMTS):
plt.subplot(2,3,i+1)
sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red",shade=True)
sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue",shade=True)
plt.xlim(-40000, 200000)
plt.ylabel("")
plt.xlabel(col, fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
每月之前的对应付款金额
In 29:
df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()
Out29:
PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 |
std | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 |
min | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 |
50% | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 |
75% | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 |
max | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 |
In 30:
PAY_AMTS = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
plt.figure(figsize=(12,6))
for i, col in enumerate(PAY_AMTS):
plt.subplot(2,3,i+1)
sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red", shade=True)
sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue", shade=True)
plt.xlim(-10000, 70000)
plt.ylabel("")
plt.xlabel(col, fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
是否发生违约(default.payment.next.month重命名为Label)的人数进行对比:
In 31:
df["Label"].value_counts()
Out31:
Label
0 23364
1 6636
Name: count, dtype: int64
In 32:
label = df["Label"].value_counts()
df_label = pd.DataFrame(label).reset_index()
df_label
Out32:
Label | count | |
---|---|---|
0 | 0 | 23364 |
1 | 1 | 6636 |
In 33:
# plt.figure(figsize = (6,6))
# plt.title('Default = 0 & Not Default = 1')
# sns.set_color_codes("pastel")
# sns.barplot(x = 'Label', y="count", data=df_label)
# locs, labels = plt.xticks()
# plt.show()
In 34:
plt.figure(figsize = (5,5))
graph = sns.countplot(x="Label", data=df, palette=["red","blue"])
i = 0
for p in graph.patches:
print(type(p))
h = p.get_height()
percentage = round( 100 * df["Label"].value_counts()[i] / len(df),2)
str_percentage = f"{percentage} %"
graph.text(p.get_x()+p.get_width()/2., h - 100, str_percentage, ha="center")
i += 1
plt.title("class distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")
plt.show()
可以看到二者是很不均衡的。
In 35:
# value_counts = df['Label'].value_counts()
# # 计算每个值的百分比
# percentages = value_counts / len(df)
# # 使用matplotlib绘制柱状图
# plt.bar(value_counts.index, value_counts.values)
# # 在柱状图上添加百分比标签
# for i, v in enumerate(percentages.values):
# plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
# # 设置xy轴标签、标题
# plt.title("Class Distribution")
# plt.xticks([0,1], ["Non-Default","Default"])
# plt.xlabel("Default Payment Next Month",fontsize=12)
# plt.ylabel("Number of Clients")
# plt.show()
In 36:
value_counts = df['Label'].value_counts()
# 计算每个值的百分比
percentages = value_counts / len(df)
# 使用matplotlib绘制柱状图
plt.bar(value_counts.index, value_counts.values)
# 在柱状图上添加百分比标签
for i, v in enumerate(percentages.values):
plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
# 设置xy轴标签、标题
plt.title("Class Distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")
plt.show()
In 37:
numeric = ['LIMIT_BAL','AGE','PAY_0','PAY_2',
'PAY_3','PAY_4','PAY_5','PAY_6',
'BILL_AMT1','BILL_AMT2','BILL_AMT3',
'BILL_AMT4','BILL_AMT5','BILL_AMT6'] # 全部数值型字段
numeric
Out37:
['LIMIT_BAL',
'AGE',
'PAY_0',
'PAY_2',
'PAY_3',
'PAY_4',
'PAY_5',
'PAY_6',
'BILL_AMT1',
'BILL_AMT2',
'BILL_AMT3',
'BILL_AMT4',
'BILL_AMT5',
'BILL_AMT6']
In 38:
corr = df[numeric].corr()
corr.head()
Out38:
相关系数的热力图绘制:
In 39:
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(12,10))
sns.heatmap(corr,
mask=mask,
vmin=-1,
vmax=1,
center=0,
square=True,
cbar_kws={'shrink': .5},
annot=True,
annot_kws={'size': 10},
cmap="Blues")
plt.show()
In 40:
plt.figure(figsize=(12,10))
pair_plot = sns.pairplot(df[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','Label']],
hue='Label',
diag_kind='kde',
corner=True)
pair_plot._legend.remove()
为了检查我们的数据是否为高斯分布,我们使用一种称为分位数-分位数(QQ图)图的图形方法进行定性评估。
在QQ图中,独立变量的分位数与正态分布的预期分位数相对应。如果变量是正态分布的,QQ图中的点应该沿着45度对角线排列。
In 41:
sns.set_color_codes('pastel') # 设置样式
fig, axs = plt.subplots(5, 3, figsize=(18,18)) # 图像大小和子图设置
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
i, j = 0, 0
for f in numeric:
if j == 3:
j = 0
i = i + 1
stats.probplot(df[f], # 绘图数据:某个字段的全部取值
dist='norm', # 标准化
sparams=(df[f].mean(), df[f].std()),
plot=axs[i,j]) # 子图位置
axs[i,j].get_lines()[0].set_marker('.')
axs[i,j].grid()
axs[i,j].get_lines()[1].set_linewidth(3.0)
j = j+1
fig.tight_layout()
axs[4,2].set_visible(False)
plt.show()
针对分类型数据的处理:
In 42:
df["EDUCATION"].value_counts()
Out42:
EDUCATION
2 14030
1 10585
3 4917
5 280
4 123
6 51
0 14
Name: count, dtype: int64
In 43:
df["GRAD_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df["UNIVERSITY"] = (df["EDUCATION"] == 2).astype("category")
df["HIGH_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df.drop("EDUCATION",axis=1,inplace=True)
In 44:
df['MALE'] = (df['SEX'] == 1).astype('category')
df.drop('SEX', axis=1, inplace=True)
In 45:
df['MARRIED'] = (df['MARRIAGE'] == 1).astype('category')
df.drop('MARRIAGE', axis=1, inplace=True)
In 46:
# 划分数据
y = df['Label']
X = df.drop('Label', axis=1, inplace=False)
根据y中的类别比例进行切分:
In 47:
# 切分数据
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=24, stratify=y)
最值归一化:
In 48:
mm = MinMaxScaler()
X_train_norm = X_train_raw.copy()
X_test_norm = X_test_raw.copy()
In 49:
# LIMIT_BAL + AGE
X_train_norm['LIMIT_BAL'] = mm.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_norm['LIMIT_BAL'] = mm.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_norm['AGE'] = mm.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_norm['AGE'] = mm.transform(X_test_raw['AGE'].values.reshape(-1, 1))
In 50:
pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]
for pay in pay_list:
X_train_norm[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
X_test_norm[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))
In 51:
for i in range(1,7):
X_train_norm['BILL_AMT' + str(i)] = mm.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_test_norm['BILL_AMT' + str(i)] = mm.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_train_norm['PAY_AMT' + str(i)] = mm.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
X_test_norm['PAY_AMT' + str(i)] = mm.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
标准化过程:
In 52:
ss = StandardScaler()
X_train_std = X_train_raw.copy()
X_test_std = X_test_raw.copy()
X_train_std['LIMIT_BAL'] = ss.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_std['LIMIT_BAL'] = ss.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_std['AGE'] = ss.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_std['AGE'] = ss.transform(X_test_raw['AGE'].values.reshape(-1, 1))
In 53:
pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]
for pay in pay_list:
X_train_std[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
X_test_std[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))
In 54:
for i in range(1,7):
X_train_std['BILL_AMT' + str(i)] = ss.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_test_std['BILL_AMT' + str(i)] = ss.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_train_std['PAY_AMT' + str(i)] = ss.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
X_test_std['PAY_AMT' + str(i)] = ss.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
绘制经过编码后的数据分布:
In 55:
sns.set_color_codes('deep')
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
fig, axs = plt.subplots(1, 2, figsize=(24,6))
sns.boxplot(data=X_train_norm[numeric], ax=axs[0])
axs[0].set_title('Boxplot of normalized numeric features')
axs[0].set_xticklabels(labels=numeric, rotation=25)
axs[0].set_xlabel(' ')
sns.boxplot(data=X_train_std[numeric], ax=axs[1])
axs[1].set_title('Boxplot of standardized numeric features')
axs[1].set_xticklabels(labels=numeric, rotation=25)
axs[1].set_xlabel(' ')
fig.tight_layout()
plt.show()
In 56:
pc = len(X_train_norm.columns.values) # 25
pca = PCA(n_components=pc) # 指定主成分个数
pca.fit(X_train_norm)
sns.reset_orig()
sns.set_color_codes('pastel') # 设置绘图颜色
plt.figure(figsize = (8,4)) # 图的大小
plt.grid() # 网格设置
plt.title('Explained Variance of Principal Components') # 标题设置
plt.plot(pca.explained_variance_ratio_, marker='o') # 绘制单个主成分的方差解释比例
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o') # 累计解释方差
plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"]) # 图例设置
plt.xlabel('Principal Component Indexes') # x-y轴标题
plt.ylabel('Explained Variance Ratio')
plt.tight_layout() # 调整布局,更紧凑
plt.axvline(12, 0, ls='--') # 设置虚线x=12
plt.show() # 显示图像
代码的各部分含义如下:
pc = len(X_train_norm.columns.values) # 25
:计算训练集的特征数量,这里的结果是25。pca = PCA(n_components=pc) # 指定主成分个数
:创建一个PCA对象,指定主成分的数量为pc
,即25。pca.fit(X_train_norm)
:对训练集X_train_norm
进行PCA拟合。sns.reset_orig()
和sns.set_color_codes('pastel')
:这两行代码是使用seaborn库来设置绘图的颜色。reset_orig()
会重置颜色到默认设置,set_color_codes('pastel')
会将颜色设置为柔和色调。plt.figure(figsize = (8,4))
:创建一个新的图形,设置其大小为8x4。plt.grid()
:在图形上显示网格。plt.title('Explained Variance of Principal Components')
:设置图形的标题为“主成分的方差解释”。plt.plot(pca.explained_variance_ratio_, marker='o')
:绘制单个主成分的方差解释比例。plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
:绘制累积方差解释比例。plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"])
:为图形添加图例,分别表示单个主成分的方差解释和累积方差解释。plt.xlabel('Principal Component Indexes')
:设置x轴的标签为“主成分索引”。plt.ylabel('Explained Variance Ratio')
:设置y轴的标签为“方差解释比例”。plt.tight_layout()
:自动调整图形布局,使其看起来紧凑。plt.axvline(12, 0, ls='--')
:在x=12的位置画一条从y=0到y=1的虚线。这可能是为了标示某个特定的主成分。plt.show()
:显示图形。根据PCA的定义,主成分的顺序是不重要的,它们只按照其方差大小进行排序。
In 57:
cumsum = np.cumsum(pca.explained_variance_ratio_) # 计算累计解释性方差
cumsum
Out57:
array([0.44924877, 0.6321187 , 0.8046163 , 0.87590932, 0.92253799,
0.95438576, 0.96762706, 0.97773098, 0.9842774 , 0.98824928,
0.99088299, 0.99280785, 0.99444757, 0.99576128, 0.99690533,
0.99781622, 0.99844676, 0.99890236, 0.99924315, 0.99955744,
0.9997182 , 0.99983861, 0.99992993, 1. , 1. ])
In 58:
indexes = ['PC' + str(i) for i in range(1, pc+1)]
cumsum_df = pd.DataFrame(data=cumsum, index=indexes, columns=['var1'])
cumsum_df.head()
Out58:
var1 | |
---|---|
PC1 | 0.449249 |
PC2 | 0.632119 |
PC3 | 0.804616 |
PC4 | 0.875909 |
PC5 | 0.922538 |
In 59:
# 保留4位小数
cumsum_df['var2'] = pd.Series([round(val, 4) for val in cumsum_df['var1']],
index = cumsum_df.index)
# 转成百分比
cumsum_df['Cumulative Explained Variance'] = pd.Series(["{0:.2f}%".format(val * 100) for val in cumsum_df['var2']],
index = cumsum_df.index)
cumsum_df.head()
Out59:
In 60:
cumsum_df = cumsum_df.drop(['var1','var2'], axis=1, inplace=False)
cumsum_df.T.iloc[:,:15]
In 61:
pc = 12
pca = PCA(n_components=pc)
pca.fit(X_train_norm)
X_train = pd.DataFrame(pca.transform(X_train_norm))
X_test = pd.DataFrame(pca.transform(X_test_norm))
# 列名设置
X_train.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_test.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_train.head()
Out61:
基于 k-fold cross-validation的交叉验证:将数据分为k折,前面k-1用于训练,剩下1折用于验证。
1、混淆矩阵
$$\begin{array}{ccc}
& \text { Predicted Negative } & \text { Predicted Positive } \
\hline \text { Actual Negative } & \text { TN } & \text { FP } \
\text { Actual Positive } & \text { FN } & \text { TP }
\end{array}$$
2、准确率
$$\text { Accuracy }=\frac{T P+T N}{T P+F P+T N+F N}$$
3、精确率
$$\text { Precision, } p=\frac{T P}{T P+F P}$$
4、召回率
$$\text { Recall, } r=\frac{T P}{T P+F N}$$
5、F1_score
$${ F1_{score} }=\frac{2}{\frac{1}{r}+\frac{1}{p}}=\frac{2 r p}{r+p}$$
In 62:
# 模型训练
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
[LightGBM] [Info] Number of positive: 4977, number of negative: 17523
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000619 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 22500, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.221200 -> initscore=-1.258687
[LightGBM] [Info] Start training from score -1.258687
Out62:
LGBMClassifier
LGBMClassifier()
In 63:
# 模型预测
y_pred = lgb_clf.predict(X_test)
y_pred
Out63:
array([1, 0, 0, ..., 0, 0, 0], dtype=int64)
基于baseline的准确率acc:
In 64:
acc = accuracy_score(y_test, y_pred)
print("模型的准确率:",acc)
模型的准确率: 0.8130666666666667
模型的分类报告:
In 65:
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.84 0.94 0.89 5841
1 0.64 0.36 0.46 1659
accuracy 0.81 7500
macro avg 0.74 0.65 0.67 7500
weighted avg 0.79 0.81 0.79 7500
模型的混淆矩阵:
In 66:
# 计算混淆矩阵
cm = confusion_matrix(y_test, y_pred)
# 将混淆矩阵转换为DataFrame
cm_df = pd.DataFrame(cm, index=['Non-Defaulters', 'Defaulters'], columns=['Non-Defaulters', 'Defaulters'])
# 使用seaborn绘制混淆矩阵热力图
plt.figure(figsize=(8, 5))
sns.heatmap(cm_df, annot=True, cmap='Blues', fmt='d')
plt.title('Confusion Metrics')
plt.xlabel('Predicted value')
plt.ylabel('True Value')
plt.show()
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。