偷懒了好几天,这两天总算把信息安全的进度赶了赶,在关联算法失效的时候决定用监督学习的算法解决,起初决定采用knn来分类,在后续学习中,无意发现了svm,在测试中发现svm的准确率比最好的knn搞0.1个百分比,故最终采用了svm。下对两种监督学习进行简介。
一、简单的理论介绍
首先,对监督学习讲解一下,监督学习和无监督学习的区别就是训练数据的label的有无,监督学习需要带有label,而无监督学习不需要label。在这次实践中,我们做的是网络入侵检测,采用的数据集是kddcup的99数据集(1), 拥有42个维度,其中前41个是描述网络行为的特征值,最后一个是类型,即label。
knn,即k-NearestNeighbor,k最邻近问题,为方便介绍,假设每条数据有两个特征值x和y,一个label,即点的颜色,先将所有数据放在平面直角坐标系中,如下图1.1的红点和蓝点,红点和蓝点所构成的所有点即为训练集,而绿点则是测试点,k最邻近问题的最邻近就是直观的邻近的意思,即离得近,而k指的是找几个离得最近的,如果k=3,那么所选的点即为实线所包含的三个点,若k=5,则为虚线所包含的五个点,而对于预测点分类的预测则是根据所选k个点中最多个数的类别所确定,同样以下图为例,如果k=3,那么预测点的结果将为红色(2个红色,1个蓝色),如果k=5,那么预测点的结果将为蓝色(3个蓝色,2个红色),由此可见,参数k的选取直接影响了预测结果的准确度。
图1.1 knn示例
svm,即Support Vector Machine,中文翻译也很暴力,支持向量机,一听就给人一种懵逼的感觉,下面做简单分解解释:机,即机器,指的是这个模型是一个机器,此外它的作用是分类,所以可以理解为一个分类用的机器,support vevtoe之后再介绍。同样为了简单介绍采用二维介绍,样本同样是带有颜色label的有x和y两个属性的训练点集合,如下图1.2,svm要做什么呢,它要找一条线,使得把两个类别的点区分开来,那么对于接下来的测试点,看测试点位于哪一侧,就将其归类于该类,那么问题来了,符合这个要求的线有很多条,比如图中的黑线和灰线就是其中的两条,那么什么才是最优解呢,现在就要介绍support vector了,就是两个类别中的点离这条分割线最近的距离,如何才是最优解呢,就是让两个类别的离分割线最近的点,再回到最近的问题,什么才是最优解呢,那就是支持向量离分割线越远越好,因为距离越远,允许容纳的点越多,使得分类的越平均,更加理想。
图1.2 svm示例
上面所说的是线性可区分的,然而遇到线性不可分的该怎么办呢,svm又引入了kernel,kernel学的没有十分的透彻,简单的理解了一下,就是把平面变成了曲面,即将原平面函数变成凸二次函数,再加上约束条件,就完成了一个通过增维使非线性样点线性可区分的目的,之后用一个平面去切割凸面,则就找到了一个分割面,把测试点带入凸二次函数中,即可得到划分结果。
有个可视化svm的kernel的视频(2),个人看了表示恍然大悟的感觉
对于高维的svm,同理二维,继续升维,找到超平面,一刀切下去即可、不过据说还有些更高级的处理方式,学疏才浅,没看明白,就不瞎掰扯了。
二、实践
在这次实践中,主要用的是numpy和sklearn两个库,numpy用于存放训练集,label集,还有测试集,以及预测结果,sklearn则是提供了众多机器学习模型,因为懒,不想自己写了,所以直接调用了。
先说knn吧,先建它一个knn,后面的参数是调整k值的(默认是5),之后把原材料,即训练数据和训练标签fit进去,好了,一个knn就搭建好了。
fromsklearnimportneighbors
knn=neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(trainData,trainLabel)
之后,把测试数据predict一下,就可以得到预测结果了。
predict = knn.predict(testData)
此外,还可以知道每个类别的可能性。
predict_proba=knn.predict_proba(testData)
好了,这样一个knn的搭建、导入训练集、预测的过程就结束了。
之后再来说svm的,更加简单。
fromsklearnimport
svm=svm.SVC()
svm.fit(trainData,trainLabel)
预测一下testData。
predict=clf.predict(testData)
三、彩(shua)蛋(ping)
如开头所说,我本来是学习knn的,后来偶然间发现了svm,所以我就也测试了一下,将knn的k从1到49(range(50)呵呵哒)都试了下,结果惊喜的发现,svm比最好的knn的预测结果都准一点点,测试结果如下
1:
knn: Total:494021 Correct: 433614 Error: 60407 Accuracy: 0.8777238214569826
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
2:
knn: Total:494021 Correct: 431819 Error: 62202 Accuracy: 0.8740903726764652
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
3:
knn: Total:494021 Correct: 433837 Error: 60184 Accuracy: 0.8781752192720552
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
4:
knn: Total:494021 Correct: 433838 Error: 60183 Accuracy: 0.878177243477504
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
5:
knn: Total:494021 Correct: 434368 Error: 59653 Accuracy: 0.8792500723653448
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
6:
knn: Total:494021 Correct: 436099 Error: 57922 Accuracy: 0.8827539719971418
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
7:
knn: Total:494021 Correct: 436564 Error: 57457 Accuracy: 0.8836952275308134
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
8:
knn: Total:494021 Correct: 436603 Error: 57418 Accuracy: 0.883774171543315
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
9:
knn: Total:494021 Correct: 436921 Error: 57100 Accuracy: 0.8844178688760195
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
10:
knn: Total:494021 Correct: 436910 Error: 57111 Accuracy: 0.8843956026160831
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
11:
knn: Total:494021 Correct: 437011 Error: 57010 Accuracy: 0.8846000473664075
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
12:
knn: Total:494021 Correct: 436955 Error: 57066 Accuracy: 0.8844866918612772
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
13:
knn: Total:494021 Correct: 436978 Error: 57043 Accuracy: 0.8845332485865985
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
14:
knn: Total:494021 Correct: 437036 Error: 56985 Accuracy: 0.8846506525026264
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
15:
knn: Total:494021 Correct: 437080 Error: 56941 Accuracy: 0.8847397175423717
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
16:
knn: Total:494021 Correct: 436660 Error: 57361 Accuracy: 0.883889551253894
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
17:
knn: Total:494021 Correct: 436829 Error: 57192 Accuracy: 0.8842316419747339
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
18:
knn: Total:494021 Correct: 436522 Error: 57499 Accuracy: 0.8836102109019657
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
19:
knn: Total:494021 Correct: 436659 Error: 57362 Accuracy: 0.8838875270484453
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
20:
knn: Total:494021 Correct: 436492 Error: 57529 Accuracy: 0.883549484738503
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
21:
knn: Total:494021 Correct: 436868 Error: 57153 Accuracy: 0.8843105859872353
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
22:
knn: Total:494021 Correct: 436497 Error: 57524 Accuracy: 0.8835596057657468
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
23:
knn: Total:494021 Correct: 436452 Error: 57569 Accuracy: 0.8834685165205528
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
24:
knn: Total:494021 Correct: 436264 Error: 57757 Accuracy: 0.8830879658961865
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
25:
knn: Total:494021 Correct: 436323 Error: 57698 Accuracy: 0.8832073940176632
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
26:
knn: Total:494021 Correct: 436219 Error: 57802 Accuracy: 0.8829968766509926
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
27:
knn: Total:494021 Correct: 436259 Error: 57762 Accuracy: 0.8830778448689428
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
28:
knn: Total:494021 Correct: 436098 Error: 57923 Accuracy: 0.8827519477916931
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
29:
knn: Total:494021 Correct: 436127 Error: 57894 Accuracy: 0.882810649749707
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
30:
knn: Total:494021 Correct: 435959 Error: 58062 Accuracy: 0.8824705832343159
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
31:
knn: Total:494021 Correct: 436223 Error: 57798 Accuracy: 0.8830049734727876
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
32:
knn: Total:494021 Correct: 436117 Error: 57904 Accuracy: 0.8827904076952194
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
33:
knn: Total:494021 Correct: 436320 Error: 57701 Accuracy: 0.8832013214013169
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
34:
knn: Total:494021 Correct: 436410 Error: 57611 Accuracy: 0.8833834998917051
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
35:
knn: Total:494021 Correct: 436500 Error: 57521 Accuracy: 0.8835656783820931
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
36:
knn: Total:494021 Correct: 436557 Error: 57464 Accuracy: 0.8836810580926722
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
37:
knn: Total:494021 Correct: 436564 Error: 57457 Accuracy: 0.8836952275308134
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
38:
knn: Total:494021 Correct: 435528 Error: 58493 Accuracy: 0.881598150685902
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
39:
knn: Total:494021 Correct: 435561 Error: 58460 Accuracy: 0.881664949465711
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
40:
knn: Total:494021 Correct: 154708 Error: 339313 Accuracy: 0.31316077656617836
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
41:
knn: Total:494021 Correct: 154468 Error: 339553 Accuracy: 0.3126749672584769
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
42:
knn: Total:494021 Correct: 154426 Error: 339595 Accuracy: 0.31258995062962913
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
43:
knn: Total:494021 Correct: 150380 Error: 343641 Accuracy: 0.3044000153839614
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
44:
knn: Total:494021 Correct: 132313 Error: 361708 Accuracy: 0.26782869554128264
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
45:
knn: Total:494021 Correct: 124802 Error: 369219 Accuracy: 0.25262488841567465
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
46:
knn: Total:494021 Correct: 124646 Error: 369375 Accuracy: 0.25230911236566866
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
47:
knn: Total:494021 Correct: 116557 Error: 377464 Accuracy: 0.23593531449067956
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
48:
knn: Total:494021 Correct: 102720 Error: 391301 Accuracy: 0.20792638369623964
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
49:
knn: Total:494021 Correct: 102005 Error: 392016 Accuracy: 0.20647907680037894
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657
遂knn卒,svm胜,狭路相逢准者胜。
特此公告:
初学者学习记录,不对之处请指正,不喜勿喷。
(1) kddcup99数据集地址:http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
(2) kernel视频链接:
https://v.qq.com/x/page/k05170ntgzc.html
领取专属 10元无门槛券
私享最新 技术干货