文章/答案/技术大牛

发布

python机器学习API介绍25：高级篇——线性回归SVR

文章来源：企鹅号 - 中原说教育

LinearSVR实现了线性回归支持向量机，他是根据liblinear实现的，其函数原型为：

sklearn.svm.LinearSVC(epsilon=0.0, loss='epsilon_insensitive', dual='True', tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1.0, verbose=0, random_state=None, max_iter=1000)

参数说明：

C：一个浮点数，为惩罚项参数。

loss：一个字符串，为损失函数。当值为epsilon_insensitive时损失函数为L（它是标准SVR的损失函数）；值为square_epsilon_insensitive时表示为L的平方。

epsilon：浮点数，用于loss中的sigma参数。

dual：布尔值。如果为True，则解决对偶问题，如果为False，则解决原始问题，当n_samples>n_features时，倾向于采用False。

tol：浮点数，指定终止迭代的阈值。

fit_intercept：布尔值，如果为True，则计算截距，即决策函数中的常数项；否则忽略截距。

intercept_scaling:浮点值，如果提供了，则实例x变成了向量[x,intercept_scaling],此时相当于添加了一个人工特征，该特征对所有实例都是常数值。这个时候截距变成了intercept_scaling*人工特征的权重Ws；人工特征也参与了惩罚项的计算。

verbose：一个整数，表示是否开启verbose输出

randomstate：一个整数或者一个RandomState实例，或者为None；如果为整数，则他指定随机数生成的种子，如果为RandomState，则指定随机数生成器。如果为None，则指定随机数生成器。

max_iter：一个整数，指定最大的迭代次数。

属性说明：

coef_：一个数组，给出了各个特征的权重。

intercept_：一个数组，隔出了截距，即决定函数中的常数项。

方法说明：

fit（x, [,y]）:训练模型。

predict（x）:用模型进行预测，返回预测值

score（x,y[,sample_weight]）:返回(x,y)上预测的准确率（测试分数不超过1，但是可能为负数（当预测效果太差的时候），score值越接近1，说明预测效果越好）。

代码实例：

from sklearn.decomposition import PCA

from sklearn.manifold import MDS

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

#使用scikit-learn自带的鸢尾花数据集

from sklearn import cluster

from sklearn.metrics import adjusted_rand_score

from sklearn import mixture

from sklearn.datasets import load_diabetes

from sklearn import cross_validation,svm

from sklearn.datasets import load_iris

#创建一个糖尿病人的数据加载库

#sklearn自带的该数据库有442个样本；每个样本有10个特征

#每个特征都是浮点数，数据在-0.2~0.2之间，样本目标在25-346之间

def load_data_diabetes():

diabetes = load_diabetes()

return cross_validation.train_test_split(diabetes.data, diabetes.target,

test_size=0.3,random_state=0)

def load_data_iris():

iris = load_iris()

x_train = iris.data

y_train = iris.target

return cross_validation.train_test_split(x_train, y_train,

test_size=0.3,random_state=0, stratify=y_train)

def test_linear_SVR(*data):

x_train, x_test, y_train, y_test = data

clst = svm.LinearSVR()

clst.fit(x_train, y_train)

print("SVR Coefficients:%s, intercept %s"%(clst.coef_, clst.intercept_))

print("the test score of SVr:{:.3f}".format(clst.score(x_test, y_test)))

x_train, x_tet, y_train, y_test = load_data_diabetes()

test_linear_SVR(x_train, x_tet, y_train, y_test)

def test_linearSVR_loss(*data):

x_train, x_test, y_train, y_test = data

losses = ['epsilon_insensitive', 'squared_epsilon_insensitive']

for loss in losses:

clst = svm.LinearSVR(loss=loss)

clst.fit(x_train, y_train)

print("SVR Loss value:%s"%loss)

print("SVR coefficients:%s, intercept %s"%(clst.coef_, clst.intercept_))

print("SVR test score is:{:.3f}".format(clst.score(x_test, y_test)))

x_train, x_tet, y_train, y_test = load_data_diabetes()

test_linearSVR_loss(x_train, x_tet, y_train, y_test)

运行后对应的结果如下：

线性回归对糖尿病数据集的预测结果

由输出结果可以看出，线性回归支持向量机默认情况下对糖尿病数据集的预测性能很差，测试分为负数；不过在我们考虑了损失函数后，测试性能得到提高，达到了0.397.

接下来我们看一下epsilon值对预测性能的影响：

def test_LinearSVC_epsilon(*data):

x_train, x_test, y_train, y_test = data

epsilons = np.logspace(-2, 2)

train_scores = []

test_scores = []

for epsilon in epsilons:

clst = svm.LinearSVR(epsilon=epsilon, loss='squared_epsilon_insensitive')

clst.fit(x_train, y_train)

train_scores.append(clst.score(x_train, y_train))

test_scores.append(clst.score(x_test, y_test))

#绘图

fig = plt.figure()

ax = fig.add_subplot(1,1,1)

ax.plot(epsilons, train_scores, label="Training scores", marker='o')

ax.plot(epsilons, test_scores, label="Test scores", marker='+')

ax.set_xlabel("epsilon")

ax.set_xscale("log")

ax.set_ylabel("score value")

ax.set_ylim(0, 1.05)

ax.legend(loc="best", framealpha=0.5)

ax.set_title("LinearSVC_epsilon")

plt.show()

x_train, x_tet, y_train, y_test = load_data_diabetes()

test_LinearSVC_epsilon(x_train, x_tet, y_train, y_test)

上述代码运行结果如下：

epsilon参数对线性回归参数预测性能的影响

从运行结果可以看出，线性回归的预测性能随着epsilon值的增加而成下降趋势，下面我们来看一下惩罚项系数C对测试性能的影响，对应函数如下：

def test_LinearSVC_C(*data):

x_train, x_test, y_train, y_test = data

#测试gamma##

Cs = np.logspace(-1, 2)

train_scores = []

test_scores = []

for C in Cs:

clst = svm.LinearSVR(epsilon=0.1, loss='squared_epsilon_insensitive', C=C)

clst.fit(x_train, y_train)

train_scores.append(clst.score(x_train, y_train))

test_scores.append(clst.score(x_test, y_test))

#绘图

fig = plt.figure()

ax = fig.add_subplot(1,1,1)

ax.plot(Cs, train_scores, label="Training scores", marker='o')

ax.plot(Cs, test_scores, label="Test scores", marker='+')

ax.set_xlabel("C")

ax.set_ylabel("score value")

ax.set_ylim(0, 1.05)

ax.set_xscale("log")

ax.legend(loc="best", framealpha=0.5)

ax.set_title("LinearSVC_C")

plt.show()

x_train, x_tet, y_train, y_test = load_data_diabetes()

test_LinearSVC_C(x_train, x_tet, y_train, y_test)

运行结果如下：

C值对线性回归预测性能的影响

从运行结果可以看出，随着C值的增加，其预测性能也在上升，因为C衡量了误分类点的重要性，C越大则误分类点越重要。

发表于: 2020-01-142020-01-14 03:59:52
原文链接：https://kuaibao.qq.com/s/20200114A0AXII00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

python机器学习API介绍25：高级篇——线性回归SVR

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐