通过用户使用信用卡行为数据,建立信用卡盗刷风控模型。当用户有了新的行为,通过这个模型就可以判断是正常用户的行为,还是有人盗刷这张卡。 由于数据集是PCA降维后的数据,这样就隐藏了原始信息的敏感信息,但保留了原数据中的信息量。 深度神经网络可解释性差,数据维度是用PCA处理之后的,所以很容易出现过拟合。可以把权重值加到Loss函数中,惩罚权重,降低过拟合。
kaggle官网 下载数据需要登录。
欧洲的信用卡持卡人在2013年9月2天时间里的284807笔交易数据,其中有492笔交易是欺诈交易,占比0.172%。数据采用PCA变换映射为V1,V2,...,V28 数值型属性,只有交易时间和金额这两个变量没有经过PCA变换。输出变量为二值变量,1为正常,0为欺诈交易。
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
count mean std min 25% \
Time 284807.0 9.481386e+04 47488.145955 0.000000 54201.500000
V1 284807.0 3.919560e-15 1.958696 -56.407510 -0.920373
V2 284807.0 5.688174e-16 1.651309 -72.715728 -0.598550
V3 284807.0 -8.769071e-15 1.516255 -48.325589 -0.890365
V4 284807.0 2.782312e-15 1.415869 -5.683171 -0.848640
V5 284807.0 -1.552563e-15 1.380247 -113.743307 -0.691597
V6 284807.0 2.010663e-15 1.332271 -26.160506 -0.768296
V7 284807.0 -1.694249e-15 1.237094 -43.557242 -0.554076
V8 284807.0 -1.927028e-16 1.194353 -73.216718 -0.208630
V9 284807.0 -3.137024e-15 1.098632 -13.434066 -0.643098
V10 284807.0 1.768627e-15 1.088850 -24.588262 -0.535426
V11 284807.0 9.170318e-16 1.020713 -4.797473 -0.762494
V12 284807.0 -1.810658e-15 0.999201 -18.683715 -0.405571
V13 284807.0 1.693438e-15 0.995274 -5.791881 -0.648539
V14 284807.0 1.479045e-15 0.958596 -19.214325 -0.425574
V15 284807.0 3.482336e-15 0.915316 -4.498945 -0.582884
V16 284807.0 1.392007e-15 0.876253 -14.129855 -0.468037
V17 284807.0 -7.528491e-16 0.849337 -25.162799 -0.483748
V18 284807.0 4.328772e-16 0.838176 -9.498746 -0.498850
V19 284807.0 9.049732e-16 0.814041 -7.213527 -0.456299
V20 284807.0 5.085503e-16 0.770925 -54.497720 -0.211721
V21 284807.0 1.537294e-16 0.734524 -34.830382 -0.228395
V22 284807.0 7.959909e-16 0.725702 -10.933144 -0.542350
V23 284807.0 5.367590e-16 0.624460 -44.807735 -0.161846
V24 284807.0 4.458112e-15 0.605647 -2.836627 -0.354586
V25 284807.0 1.453003e-15 0.521278 -10.295397 -0.317145
V26 284807.0 1.699104e-15 0.482227 -2.604551 -0.326984
V27 284807.0 -3.660161e-16 0.403632 -22.565679 -0.070840
V28 284807.0 -1.206049e-16 0.330083 -15.430084 -0.052960
Amount 284807.0 8.834962e+01 250.120109 0.000000 5.600000
Class 284807.0 1.727486e-03 0.041527 0.000000 0.000000
50% 75% max
Time 84692.000000 139320.500000 172792.000000
V1 0.018109 1.315642 2.454930
V2 0.065486 0.803724 22.057729
V3 0.179846 1.027196 9.382558
V4 -0.019847 0.743341 16.875344
V5 -0.054336 0.611926 34.801666
V6 -0.274187 0.398565 73.301626
V7 0.040103 0.570436 120.589494
V8 0.022358 0.327346 20.007208
V9 -0.051429 0.597139 15.594995
V10 -0.092917 0.453923 23.745136
V11 -0.032757 0.739593 12.018913
V12 0.140033 0.618238 7.848392
V13 -0.013568 0.662505 7.126883
V14 0.050601 0.493150 10.526766
V15 0.048072 0.648821 8.877742
V16 0.066413 0.523296 17.315112
V17 -0.065676 0.399675 9.253526
V18 -0.003636 0.500807 5.041069
V19 0.003735 0.458949 5.591971
V20 -0.062481 0.133041 39.420904
V21 -0.029450 0.186377 27.202839
V22 0.006782 0.528554 10.503090
V23 -0.011193 0.147642 22.528412
V24 0.040976 0.439527 4.584549
V25 0.016594 0.350716 7.519589
V26 -0.052139 0.240952 3.517346
V27 0.001342 0.091045 31.612198
V28 0.011244 0.078280 33.847808
Amount 22.000000 77.165000 25691.160000
Class 0.000000 0.000000 1.000000
数据主要是由PCA产生,不需要过多预处理。
数据集有一个特点,正标签很少,因此在训练的时候应该均和正负标签。
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data_df = pd.read_csv("/Users/wangsen/ai/13/homework/creditcard.csv")
# data_df.info()
# print(data_df.describe().T)
#print(data_df.Time.head(100))
data_df = data_df.drop("Time",axis=1)
neg_df = data_df[data_df.Class==0]
pos_df = data_df[data_df.Class==1]
#print(neg_df.head())
#print(pos_df.head())
neg_data = neg_df.drop('Class',axis=1).values
pos_data = pos_df.drop('Class',axis=1).values
print("neg_data shape:",neg_data.shape)
print("pos_data shape:",pos_data.shape)
import tensorflow as tf
X = tf.placeholder(dtype=tf.float32,shape=[None,29])
label = tf.placeholder(dtype=tf.float32,shape=[None,2])
net = tf.layers.dense(X,16,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
net = tf.layers.dense(net,256,tf.nn.relu)
y = tf.layers.dense(net,2,None)
#y = tf.nn.softmax(y)
loss = tf.losses.softmax_cross_entropy(label,y)
#loss = tf.reduce_mean(tf.square(label-y))
#loss = tf.reduce_sum(-label*tf.log(y))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(label, 1)), tf.float32))
train_step = tf.train.AdamOptimizer(0.0001).minimize(loss)
# train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
neg_high = neg_data.shape[0]
pos_high = pos_data.shape[0]
input_y = np.zeros([64, 2])
input_y[:32, 0] = 1
input_y[32:, 1] = 1
test_x = np.concatenate([neg_data[10000:10000+450],pos_data[0:450]])
test_y = np.zeros([900,2])
test_y[:450,0] = 1
test_y[450:,1] = 1
sess = tf.Session()
sess.run(tf.global_variables_initializer())
l = []
s = []
for itr in range(10000):
neg_ind = np.random.randint(0,neg_high,32)
pos_ind = np.random.randint(0,pos_high,32)
input_x = np.concatenate([neg_data[neg_ind],pos_data[pos_ind]])
_,loss_var = sess.run((train_step,loss),feed_dict={X:input_x,label:input_y})
if itr%100==0:
accuracy_var = sess.run(accuracy,feed_dict={X:test_x,label:test_y})
print("iter: accurency: loss:%f",(itr,accuracy_var,loss_var))
s.append(accuracy_var)
l.append(loss_var)
import matplotlib.pyplot as plt
plt.plot(l,color="red")
plt.plot(s,color="green")
plt.show()
'''
neg_data shape: (284315, 29)
pos_data shape: (492, 29)
'''
iter: accurency: loss:%f (9900, 0.98888886, 0.06290674)
损失函数和准确率
loss_w = [tf.nn.l2_loss(var) for var in tf.trainable_variables() if "kernel" in var.name]
print("variables:",tf.trainable_variables())
weights_norm = tf.reduce_sum(loss_w)
loss = tf.losses.softmax_cross_entropy(label,y)+0.001*weights_norm
variables: [<tf.Variable 'dense/kernel:0' shape=(29, 16) dtype=float32_ref>, <tf.Variable 'dense/bias:0' shape=(16,) dtype=float32_ref>, <tf.Variable 'dense_1/kernel:0' shape=(16, 256) dtype=float32_ref>, <tf.Variable 'dense_1/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_2/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_2/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_3/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_3/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_4/kernel:0' shape=(256, 256) dtype=float32_ref>, <tf.Variable 'dense_4/bias:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'dense_5/kernel:0' shape=(256, 2) dtype=float32_ref>, <tf.Variable 'dense_5/bias:0' shape=(2,) dtype=float32_ref>]
iter:4900 accurency:0.972222 loss:0.291221 weight:197.521942
添加L2惩罚项