前面对GBDT的算法原理进行了描述,通过前文了解到GBDT是以回归树为基分类器的集成学习模型,既可以做分类,也可以做回归,由于GBDT设计很多CART决策树相关内容,就暂不对其算法流程进行实现,本节就根据具体数据,直接利用Python自带的Sklearn工具包对GBDT进行实现。
数据集采用之前决策树中的红酒数据集,之前的数据集我们做了类别的处理(将连续的数据删除了,且小批量数据进行了合并),这里做同样的处理,将其看为一个多分类问题。
首先依旧是读取数据,并对数据进行检查和预处理,这里就不再赘述,所得数据情况如下:
wine_df = pd.read_csv('./winequality-red.csv', delimiter=';', encoding='utf-8')
columns_name = list(wine_df.columns)
for name in columns_name:
q1, q2, q3 = wine_df[name].quantile([0.25, 0.5, 0.75])
IQR = q3 - q1
lower_cap = q1 - 1.5 * IQR
upper_cap = q3 + 1.5 * IQR
wine_df[name] = wine_df[name].apply(lambda x: upper_cap if x > upper_cap else (lower_cap if (x < lower_cap) else x))
sns.countplot(wine_df['quality'])
wine_df.describe()
接下来就是先导入使用GBDT所需要用到的工具包:
# 这里采用的是分类,因此是GradientBoostingClassifier,如果是回归则使用GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt
然后依旧是对数据进行切分,将数据分为训练集和测试集:
trainX, testX, trainY, testY = train_test_split(wine_df.drop(['quality']), wine_df['quality'], test_size=0.3, random_state=22)
然后就是建立模型:
model = GradientBoostingClssifier()
这里模型就有很多可选参数,用于调整模型,下面进行具体介绍:
首先是Boosting的框架的参数,这里在使用GradientBoostingRegressor和GradientBoostingClassifier是一样的,具体包括:
然后就是弱分类器有关的参数值,弱分类器采用的CART回归树,决策树中的相关参数在决策树实现部分已经进行介绍,这里主要对其中一些重要的参数再进行解释:
上述即为模型的主要参数,这里首先全部使用默认值,对样本进行训练:
model.fit(trainX, trainY)
print("模型在训练集上分数为%s"%model.score(trainX, trainY))
pred_prob = model.predict_proba(trainX)
print('AUC:', metrics.roc_auc_score(np.array(trainY), pred_prob, multi_class='ovo'))
模型在训练集上分数为0.8817106460418562 AUC: 0.9757763363472337
可以看到在训练集上AUC表现还不错,模型的分数但并不高,尝试调整训练参数,首先对于迭代次数和学习率共同进行调整:
param_test1 = {'n_estimators': range(10, 501, 10), 'learning_rate': np.linspace(0.1, 1, 10)}
gsearch = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=1,
min_samples_split=2,
min_samples_leaf=1,
max_depth=3,
max_features=None,
subsample=0.8,
), param_grid=param_test1, cv=5)
gsearch.fit(trainX, trainY)
means = gsearch.cv_results_['mean_test_score']
params = gsearch.cv_results_['params']
for i in range(len(means)):
print(params[i], means[i])
print(gsearch.best_params_)
print(gsearch.best_score_)
# {'learning_rate': 0.2, 'n_estimators': 100}
找出最好的n_estimators=100和learning_rate=0.2,将其定下来,带回模型,再次验证:
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.2,subsample=0.8)
model.fit(trainX, trainY)
print("模型在训练集上分数为%s"%model.score(trainX, trainY))
pred_prob = model.predict_proba(trainX)
print('AUC:', metrics.roc_auc_score(np.array(trainY), pred_prob, multi_class='ovo'))
模型在训练集上分数为0.9663330300272975 AUC: 0.9977791940084874
可以看到拟合效果已经很好了,再次调整参数,接下来调整弱分类器中的参数,max_depth和min_samples_split:
param_test1 = {'max_depth': range(1, 6, 1), 'min_samples_split': range(1, 101, 10)}
gsearch2 = GridSearchCV(estimator=GradientBoostingClassifier(n_estimators=100,
learning_rate=0.2,
max_features=None,
min_samples_leaf=1,
subsample=0.8,
), param_grid=param_test1, cv=5)
gsearch2.fit(trainX, trainY)
means = gsearch2.cv_results_['mean_test_score']
params = gsearch2.cv_results_['params']
for i in range(len(means)):
print(params[i], means[i])
print(gsearch2.best_params_)
print(gsearch2.best_score_)
找出了树的最大深度为5,由于最小样本划分数量同叶子节点最小样本数量有一定关系,暂时不能定下min_samples_split,将其同min_samples_leaf共同调整:
param_test1 = {'min_samples_leaf': range(1, 101, 10), 'min_samples_split': range(1, 101, 10)}
gsearch3 = GridSearchCV(estimator=GradientBoostingClassifier(n_estimators=100,
learning_rate=0.2,
max_features=None,
max_depth=5,
subsample=0.8,
), param_grid=param_test1, cv=5)
gsearch3.fit(trainX, trainY)
means = gsearch3.cv_results_['mean_test_score']
params = gsearch3.cv_results_['params']
for i in range(len(means)):
print(params[i], means[i])
print(gsearch3.best_params_)
print(gsearch3.best_score_)
# {'min_samples_leaf': 21, 'min_samples_split': 41}
可以找出最小样本划分数量21和叶子节点最小数量,我们将这些参数再带回模型:
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, max_depth=5, min_samples_leaf=21, min_samples_split=41, subsample=0.8)
model.fit(trainX, trainY)
print("模型在训练集上分数为%s"%model.score(trainX, trainY))
pred_prob = model.predict_proba(trainX)
print('AUC:', metrics.roc_auc_score(np.array(trainY), pred_prob, multi_class='ovo'))
模型在训练集上分数为1.0 AUC: 1.0
可以看到在训练集上已经完美拟合了,但为了验证模型,我们需要再分离出一部分用于验证模型的数据集:
validX, tX, validY, tY = train_test_split(testX, testY, test_size=0.2)
然后使用验证集,验证模型:
print("模型在测试集上分数为%s"%metrics.accuracy_score(validY, model.predict(validX)))
pred_prob = model.predict_proba(validX)
print('AUC test:', metrics.roc_auc_score(np.array(validY), pred_prob, multi_class='ovo'))
模型在测试集上分数为0.726790450928382 AUC test: 0.8413890948027345
可以看到模型在验证集上表现并不是很好,上面模型存在一定的过拟合问题,继续调整参数,通过调整max_features来提高模型的泛华能力:
param_test1 = {'max_features': range(3, 12, 1)}
gsearch4 = GridSearchCV(estimator=GradientBoostingClassifier(n_estimators=100,
learning_rate=0.2,
min_samples_leaf=21,
min_samples_split=41,
max_depth=5,
subsample=0.8,
), param_grid=param_test1, cv=5)
gsearch4.fit(trainX, trainY)
means = gsearch4.cv_results_['mean_test_score']
params = gsearch4.cv_results_['params']
for i in range(len(means)):
print(params[i], means[i])
print(gsearch4.best_params_)
print(gsearch4.best_score_)
# {'max_features': 5}
进一步调整subsamples:
param_test1 = {'subsample': np.linspace(0.1, 1, 10)}
gsearch5 = GridSearchCV(estimator=GradientBoostingClassifier(n_estimators=100,
learning_rate=0.2,
min_samples_leaf=21,
min_samples_split=41,
max_depth=5,
max_features=5
), param_grid=param_test1, cv=5)
gsearch5.fit(trainX, trainY)
means = gsearch5.cv_results_['mean_test_score']
params = gsearch5.cv_results_['params']
for i in range(len(means)):
print(params[i], means[i])
print(gsearch5.best_params_)
print(gsearch5.best_score_)
# {'subsample': 0.7}
到这里基本主要参数都进行了调整,带回到模型中:
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, max_depth=5, min_samples_leaf=21, min_samples_split=41, max_features=5, subsample=0.7)
model.fit(trainX, trainY)
print("模型在训练集上分数为%s"%model.score(trainX, trainY))
pred_prob = model.predict_proba(trainX)
print('AUC:', metrics.roc_auc_score(np.array(trainY), pred_prob, multi_class='ovo'))
模型在训练集上分数为0.9990900818926297 AUC: 0.9999992641648271
有略微下降,因为通过提高模型的泛华能力,会增大模型的偏差,然后利用验证集验证模型:
print("模型在测试集上分数为%s"%metrics.accuracy_score(validY, model.predict(validX)))
pred_prob = model.predict_proba(validX)
print('AUC test:', metrics.roc_auc_score(np.array(validY), pred_prob, multi_class='ovo'))
模型在测试集上分数为0.7161803713527851
AUC test: 0.8429467644071055
进一步将模型的迭代次数增加一倍,学习率减半:
model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=5, min_samples_leaf=21, min_samples_split=41, max_features=5, subsample=0.7)
model.fit(trainX, trainY)
print("模型在训练集上分数为%s"%model.score(trainX, trainY))
pred_prob = model.predict_proba(trainX)
print('AUC:', metrics.roc_auc_score(np.array(trainY), pred_prob, multi_class='ovo'))
# validX, tX, validY, tY = train_test_split(testX, testY, test_size=0.2)
print("模型在测试集上分数为%s"%metrics.accuracy_score(validY, model.predict(validX)))
pred_prob = model.predict_proba(validX)
print('AUC test:', metrics.roc_auc_score(np.array(validY), pred_prob, multi_class='ovo'))
模型在训练集上分数为0.9990900818926297 AUC: 1.0
模型在测试集上分数为0.7427055702917772 AUC test: 0.851199242237048
本文系转载,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文系转载,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。