公众号:机器学习杂货店 作者:Peter 编辑:Peter
持续更新《Python深度学习》一书的精华内容,仅作为学习笔记分享。
本文是第三篇:介绍如何使用Keras解决Python深度学习中的多分类问题。
多分类问题和二分类问题的区别注意两点:
运行环境:Python3.9.13 + Keras2.12.0 + tensorflow2.12.0
机器学习中的路透社数据集是一个非常常用的数据集,它包含来自新闻专线的文本数据,主要用于文本分类任务。这个数据集是由路透社新闻机构提供的,包含了大量的新闻文章,共计22类分类标签。
In 1:
import numpy as np
np.random.seed(1234)
import warnings
warnings.filterwarnings("ignore")
In 2:
from keras.datasets import reuters
In 3:
# 取出数据中前10000个词语
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)
In 4:
train_data[:2]
Out4:
array([list([1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]),
list([1, 3267, 699, 3434, 2295, 56, 2, 7511, 9, 56, 3906, 1073, 81, 5, 1198, 57, 366, 737, 132, 20, 4093, 7, 2, 49, 2295, 2, 1037, 3267, 699, 3434, 8, 7, 10, 241, 16, 855, 129, 231, 783, 5, 4, 587, 2295, 2, 2, 775, 7, 48, 34, 191, 44, 35, 1795, 505, 17, 12])],
dtype=object)
In 5:
len(train_data), len(test_data)
Out5:
(8982, 2246)
查看label中数据信息:总共是46个类别
In 6:
train_labels[:20]
Out6:
array([ 3, 4, 3, 4, 4, 4, 4, 3, 3, 16, 3, 3, 4, 4, 19, 8, 16,
3, 3, 21], dtype=int64)
In 7:
test_labels[:20]
Out7:
array([ 3, 10, 1, 4, 4, 3, 3, 3, 3, 3, 5, 4, 1, 3, 1, 11, 23,
3, 19, 3], dtype=int64)
单词和索引的互换:
In 8:
word_index = reuters.get_word_index()
reverse_word_index = dict([value, key] for (key, value) in word_index.items()) # 翻转过程
decoded_review = ' '.join([reverse_word_index.get(i-3, "?") for i in train_data[0]])
decoded_review
Out8:
'? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3'
关于数据向量化的过程:
In 9:
# 同样的向量化函数
import numpy as np
def vectorszie(seq, dim=10000):
"""
seq: 输入序列
dim:10000,维度
"""
results = np.zeros((len(seq), dim)) # 创建全0矩阵 length * dim
for i, s in enumerate(seq):
results[i,s] = 1. # 将该位置的值从0变成1,如果没有出现则还是0
return results
In 10:
# 两个数据向量化
x_train = vectorszie(train_data)
x_test = vectorszie(test_data)
针对标签向量化方法1:自定义独热编码函数
In 11:
# 1、手动实现
def to_one_hot(labels, dimension=10000):
results = np.zeros((len(labels), dimension)) # 全0矩阵 np.zeros((m, n))
for i, label in enumerate(labels):
results[i,labels] = 1. # 一定是浮点数
return results
# 调用定义的函数
one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)
针对标签向量化方法2:基于keras内置函数来实现
In 12:
# keras内置方法
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
如果我们不想将分类标签(46个取值)转成独热码形式,可以使用稀疏分类标签:sparse_categorical_crossentropy。
使用方法都是类似的:
y_train = np.array(train_labels)
y_test = np.array(test_labels)
model.compile(
optimizer='rmsprop', # 优化器
loss='sparse_categorical_crossentropy', # 稀疏分类损失
metrics=['accuracy'] # 评价指标
)
In 13:
# 取出1000个样本作为验证集
x_val = x_train[:1000]
part_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
part_y_train = one_hot_train_labels[1000:]
In 14:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(x_train.shape[1],))) # X_train.shape[1] = 10000
model.add(layers.Dense(64,activation="relu"))
model.add(layers.Dense(46, activation="softmax")) # 46就是最终的分类数目
对比二分类问题,有3个需要注意的点:
In 15:
model.compile(optimizer='rmsprop', # 优化器
loss='categorical_crossentropy', # 多分类交叉熵categorical_crossentropy
metrics=['accuracy'] # 评价指标
)
In 16:
## 训练网络
In 17:
history = model.fit(part_x_train, # input
part_y_train, # output
epochs=20, # 训练20个轮次
batch_size=512, # 每次迭代使用512个样本的小批量
validation_data=[x_val,y_val] # 验证集的数据
)
Epoch 1/20
16/16 [==============================] - 1s 26ms/step - loss: 2.6860 - accuracy: 0.4868 - val_loss: 1.8084 - val_accuracy: 0.6240
Epoch 2/20
16/16 [==============================] - 0s 14ms/step - loss: 1.5509 - accuracy: 0.6750 - val_loss: 1.3812 - val_accuracy: 0.6850
Epoch 3/20
16/16 [==============================] - 0s 14ms/step - loss: 1.2006 - accuracy: 0.7357 - val_loss: 1.1962 - val_accuracy: 0.7300
......
Epoch 18/20
16/16 [==============================] - 0s 14ms/step - loss: 0.1567 - accuracy: 0.9559 - val_loss: 0.9402 - val_accuracy: 0.8110
Epoch 19/20
16/16 [==============================] - 0s 14ms/step - loss: 0.1439 - accuracy: 0.9559 - val_loss: 0.9561 - val_accuracy: 0.8040
Epoch 20/20
16/16 [==============================] - 0s 13ms/step - loss: 0.1401 - accuracy: 0.9546 - val_loss: 0.9467 - val_accuracy: 0.8090
In 18:
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 64) 640064
dense_1 (Dense) (None, 64) 4160
dense_2 (Dense) (None, 46) 2990
=================================================================
Total params: 647,214
Trainable params: 647,214
Non-trainable params: 0
_________________________________________________________________
In 19:
x_test
Out19:
array([[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
...,
[0., 1., 0., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.],
[0., 1., 1., ..., 0., 0., 0.]])
In 20:
one_hot_test_labels
Out20:
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
In 21:
# one_hot_test_labels 经历了独热编码后的labels
model.evaluate(x_test, one_hot_test_labels)
71/71 [==============================] - 0s 1ms/step - loss: 1.0572 - accuracy: 0.7872
Out21:
[1.0572034120559692, 0.7871772050857544]
In 22:
his_dict = history.history # 字典类型
his_dict.keys()
Out22:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
In 23:
import matplotlib.pyplot as plt
loss = his_dict["loss"]
val_loss = his_dict["val_loss"]
acc = his_dict["accuracy"]
val_acc = his_dict["val_accuracy"]
In 24:
epochs = range(1, len(loss) + 1) # 作为横轴
# 1、损失loss
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.title("Training and Validation Loss")
plt.show()
针对精度的可视化过程:
In 25:
# 2、精度acc
plt.clf() # 清空图像
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.xlabel("Epochs")
plt.ylabel("Acc")
plt.legend()
plt.title("Training and Validation Acc")
plt.show()
可以看到loss在训练集上逐渐减小的;但是在验证集上到达第8轮后保持不变;精度acc也在训练集上表现良好,但是在验证集上在第9轮后基本不变。
显然是出现了过拟合。我们重新训练指定9轮
In 26:
from keras.datasets import reuters
import numpy as np
# 取出数据中前10000个词语
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)
def vectorszie(seq, dim=10000):
"""
seq: 输入序列
dim:10000,维度
"""
results = np.zeros((len(seq), dim)) # 创建全0矩阵 length * dim
for i, s in enumerate(seq):
results[i,s] = 1. # 将该位置的值从0变成1,如果没有出现则还是0
return results
# 两个数据向量化
x_train = vectorszie(train_data)
x_test = vectorszie(test_data)
# one-hot编码
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
# 取出1000个样本作为验证集
x_val = x_train[:1000]
part_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
part_y_train = one_hot_train_labels[1000:]
# 构建网络
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(x_train.shape[1],))) # X_train.shape[1] = 10000
model.add(layers.Dense(64,activation="relu"))
model.add(layers.Dense(46, activation="softmax")) # 46就是最终的分类数目
# 模型编译
model.compile(optimizer='rmsprop', # 优化器
loss='categorical_crossentropy', # 多分类交叉熵categorical_crossentropy
metrics=['accuracy'] # 评价指标
)
model.fit(part_x_train, # input
part_y_train, # output
epochs=9, # 训练个9轮次
verbose=0, # 是否显示训练细节
batch_size=512, # 每次迭代使用512个样本的小批量
validation_data=[x_val,y_val] # 验证集的数据
)
# 模型评估
model.evaluate(x_test, one_hot_test_labels)
71/71 [==============================] - 0s 1ms/step - loss: 0.9535 - accuracy: 0.7801
Out26:
[0.9534968733787537, 0.780053436756134]
可以看到精度接近79%
如何查看预测类别?以第一个数据的预测结果为例:
In 27:
results = model.predict(x_test)
results
71/71 [==============================] - 0s 997us/step
Out27:
array([[5.1504822e-04, 1.0902017e-04, 2.1993063e-04, ..., 1.9025596e-05,
4.6712950e-07, 3.7851787e-05],
[5.1406571e-03, 4.2032253e-02, 1.9307269e-03, ..., 1.3830569e-02,
5.6258432e-04, 3.1604938e-04],
[4.0979325e-03, 7.7002281e-01, 6.7354720e-03, ..., 1.8014901e-03,
4.9561085e-03, 9.9538732e-04],
...,
[6.1581237e-04, 1.1025119e-03, 3.9810984e-04, ..., 2.9050951e-05,
2.4186371e-05, 6.5296721e-05],
[3.6575866e-03, 1.0463378e-02, 3.1981221e-03, ..., 2.7204564e-04,
9.7423712e-05, 1.8902053e-03],
[1.8005458e-03, 7.0240724e-01, 1.8455695e-02, ..., 1.9976693e-04,
5.6885678e-04, 1.8073655e-04]], dtype=float32)
In 28:
predict_one = results[0]
predict_one
Out28:
array([5.1504822e-04, 1.0902017e-04, 2.1993063e-04, 3.9642093e-01,
5.7400799e-01, 1.5043363e-04, 4.0421914e-05, 7.1661170e-06,
3.2984249e-03, 1.5247319e-04, 3.9692928e-05, 3.0673095e-03,
1.3204347e-03, 3.9371965e-04, 2.1458001e-04, 1.5276371e-04,
1.8565950e-03, 1.2035699e-04, 6.1764423e-04, 7.4270181e-03,
3.6794273e-03, 2.7725848e-03, 2.1595537e-05, 7.5044850e-04,
1.5939959e-05, 3.3097478e-04, 9.2904102e-06, 1.1782978e-04,
3.3141983e-05, 2.0210361e-04, 4.8371754e-04, 2.3283543e-04,
5.7479672e-05, 3.8166454e-05, 9.9279227e-05, 3.2270618e-05,
2.8716330e-04, 3.2858396e-05, 1.2131617e-05, 3.3482770e-04,
9.0265028e-05, 1.7225980e-04, 4.2123888e-06, 1.9025596e-05,
4.6712950e-07, 3.7851787e-05], dtype=float32)
In 29:
len(predict_one) # 总长度是46
Out29:
46
In 30:
np.sum(predict_one) # 预测总和是1
Out30:
1.0000002
如何找到哪个概率最大的元素所在的位置索引?使用np.argmax函数。该位置索引就是预测的最终类别。
In 31:
np.argmax(predict_one)
Out31:
4
所以第一个数据预测的类别是第3类。
所有测试集的预测结果:
In 32:
# 基于列表推导式
# 预测值
y_predict = [np.argmax(result) for result in results]
y_predict[:20]
Out32:
[4, 10, 1, 4, 13, 3, 3, 3, 3, 3, 1, 4, 1, 3, 1, 11, 4, 3, 19, 3]
In 33:
test_labels[:20] # 真实值
Out33:
array([ 3, 10, 1, 4, 4, 3, 3, 3, 3, 3, 5, 4, 1, 3, 1, 11, 23,
3, 19, 3], dtype=int64)
In 34:
from sklearn.metrics import classification_report, confusion_matrix, r2_score, recall_score, accuracy_score
In 35:
# 精度、R2
print("多分类预测建模的精度acc为: ",accuracy_score(test_labels,y_predict))
print("多分类预测建模的R方为: ",r2_score(test_labels, y_predict))
# print("多分类预测的报告: \n",classification_report(y_predict, test_labels))
多分类预测建模的精度acc为: 0.780053428317008
多分类预测建模的R方为: 0.4157152870789089
预测的精度为78%左右
根据预测结果和真实值,从头实现精度的计算,不调用任何相关模块。
In 36:
import pandas as pd
df = pd.DataFrame({"y_test":test_labels,
"y_predict":y_predict
})
df.head()
Out36:
y_test | y_predict | |
---|---|---|
0 | 3 | 4 |
1 | 10 | 10 |
2 | 1 | 1 |
3 | 4 | 4 |
4 | 4 | 13 |
In 37:
df["result"] = (df["y_test"] == df["y_predict"]) # 判断相等为True 否则为False
df.head()
Out37:
y_test | y_predict | result | |
---|---|---|---|
0 | 3 | 4 | False |
1 | 10 | 10 | True |
2 | 1 | 1 | True |
3 | 4 | 4 | True |
4 | 4 | 13 | False |
统计不同原标签的正确预测数目:sum求和只对True(变成1),False为0
In 38:
df1 = df.groupby("y_test")["result"].sum()
df1.head(10)
Out38:
y_test
0 7
1 84
2 12
3 744
4 437
5 0
6 12
7 1
8 23
9 16
Name: result, dtype: int64
In 39:
df1.sort_values(ascending=False).head(10)
Out39:
y_test
3 744
4 437
19 96
1 84
16 75
11 68
20 34
10 25
8 23
13 22
Name: result, dtype: int64
可以看到第3、4、19类别是预测准确最多的。原始数据中每个类别的数目:
In 40:
df2 = df["y_test"].value_counts().sort_index()
df2.head(10)
Out40:
0 12
1 105
2 20
3 813
4 474
5 5
6 14
7 3
8 38
9 25
Name: y_test, dtype: int64
In 41:
df1.values
Out41:
array([ 7, 84, 12, 744, 437, 0, 12, 1, 23, 16, 25, 68, 0,
22, 0, 1, 75, 2, 12, 96, 34, 18, 0, 3, 4, 20,
4, 1, 1, 0, 6, 2, 6, 1, 5, 0, 2, 0, 0,
0, 0, 1, 0, 3, 4, 0], dtype=int64)
In 42:
# 将df1-df2合并
df3 = pd.DataFrame({"predict":df1.values, # 预测正确数目
"true":df2.values}) # 原数据数目
df3.head()
Out42:
predict | true | |
---|---|---|
0 | 7 | 12 |
1 | 84 | 105 |
2 | 12 | 20 |
3 | 744 | 813 |
4 | 437 | 474 |
In 43:
df3["precision"] = df3["predict"] / df3["true"]
df3.head(10)
Out43:
predict | true | precision | |
---|---|---|---|
0 | 7 | 12 | 0.583333 |
1 | 84 | 105 | 0.800000 |
2 | 12 | 20 | 0.600000 |
3 | 744 | 813 | 0.915129 |
4 | 437 | 474 | 0.921941 |
5 | 0 | 5 | 0.000000 |
6 | 12 | 14 | 0.857143 |
7 | 1 | 3 | 0.333333 |
8 | 23 | 38 | 0.605263 |
9 | 16 | 25 | 0.640000 |
可以和分类报告中的precision进行对比,结果是一致的(除去小数位问题)
In 44:
print("多分类预测的报告: \n",classification_report(y_predict, test_labels))
# 结果(部分)
多分类预测的报告:
precision recall f1-score support
0 0.58 0.88 0.70 8
1 0.80 0.60 0.68 141
2 0.60 0.86 0.71 14
3 0.92 0.94 0.93 788
4 0.92 0.75 0.82 586
5 0.00 0.00 0.00 0
6 0.86 0.80 0.83 15
7 0.33 1.00 0.50 1
8 0.61 0.70 0.65 33
9 0.64 0.84 0.73 19
10 0.83 0.81 0.82 31
11 0.82 0.51 0.63 133
12 0.00 0.00 0.00 2
13 0.59 0.61 0.60 36
14 0.00 0.00 0.00 0
15 0.11 0.50 0.18 2
16 0.76 0.72 0.74 104
17 0.17 1.00 0.29 2
18 0.60 0.60 0.60 20
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。