使用Scikit-Learn的HalvingGridSearchCV进行更快的超参数调优

deephub

发布于 2021-07-01 10:58:05

7670

发布于 2021-07-01 10:58:05

文章被收录于专栏：DeepHub IMBA

如果你是Scikit-Learn的粉丝，那么0.24.0版本你一定会喜欢。里面新特性包括model_selection模块中的两个实验性超参数优化器类:HalvingGridSearchCV和HalvingRandomSearchCV。

和它们的近亲GridSearchCV和RandomizedSearchCV一样，它们使用交叉验证来寻找最佳超参数。然而，他们的连续二分搜索策略并不是独立搜索超参数集候选项，而是“开始用少量资源评估所有候选项，并使用越来越多的资源迭代地选择最佳候选项。”默认资源是样本的数量，但用户可以将其设置为任何正整数模型参数，如梯度增强轮。因此，减半方法具有在更短的时间内找到好的超参数的潜力。

我通读了Scikit-Learn的“Comparison between grid search and successive halving”示例并进行了测试，但是由于总共花费了11秒的时间，因此我仍然不清楚使用减半与穷举方法对实际操作的影响。因此，我决定建立一个实验来回答以下问题：

HalvingGridSearchCV与GridSearchCV相比要快多少？
HalvingGridSearchCV是否仍选择与GridSearchCV相同的超参数集？

我将运行并比较3个搜索：

GridSearchCV
使用默认的“ n_samples”资源进行HalvingGridSearchCV
使用CatBoost的“ n_estimators”作为资源的HalvingGridSearchCV

升级Scikit-Learn

第一步是将Scikit的版本升级到0.24.0，并确保可以导入正确的版本。

 # !! pip install scikit-learn --upgrade
 import sklearn
 print(sklearn.__version__)
 0.24.0

加载数据集

我使用Kaggle的爱荷华州艾姆斯房价数据集进行了测试。它具有1,460个观测值和79个特征。因变量是房屋的SalePrice。

 import numpy as np  
 import pandas as pd  
 
 DEP_VAR = 'SalePrice'
 train_df = pd.read_csv('../kaggle/input/house-prices-advanced-regression-techniques/train.csv')\
            .set_index('Id')
             
 y_train = train_df.pop(DEP_VAR)

创建处理管道和模型

我还编写了一个名为pipeline_ames.py的脚本。它实例化包含某些功能转换和CatBoostRegressor的管道。我在下面绘制了它的视觉表示。

 from sklearn import set_config        
 from sklearn.utils import estimator_html_repr   
 from IPython.core.display import display, HTML     
 
 from pipeline_ames import pipe
 set_config(display='diagram')
 display(HTML(estimator_html_repr(pipe)))

实验测试

grid_search_paramsdictionary包含在3个搜索中使用的控制参数。我对param_grid进行了3倍交叉验证，该验证包含4个CatBoost超参数，每个参数具有3个值。结果以均方根对数误差（RMSLE）进行测量。

 from sklearn.metrics import mean_squared_log_error, make_scorer
 
 np.random.seed(123) # set a global seed
 pd.set_option("display.precision", 4)
 
 rmsle = lambda y_true, y_pred:\     
     np.sqrt(mean_squared_log_error(y_true, y_pred))
 scorer = make_scorer(rmsle, greater_is_better=False)
 
 param_grid = {"model__max_depth": [5, 6, 7],
               'model__learning_rate': [.01, 0.03, .06],
               'model__subsample': [.7, .8, .9],
               'model__colsample_bylevel': [.8, .9, 1]}
 
 grid_search_params = dict(estimator=pipe,
                           param_grid=param_grid,
                           scoring=scorer,
                           cv=3,
                           n_jobs=-1,
                           verbose=2)

GridSearchCV

基线详尽的网格搜索花费了将近33分钟才能对我们的81位候选人进行3倍交叉验证。我们将看看HalvingGridSearchCV进程是否可以在更短的时间内找到相同的超参数。

 %%time
 from sklearn.model_selection import GridSearchCV
 
 full_results = GridSearchCV(**grid_search_params)\
                .fit(train_df, y_train)
 
 pd.DataFrame(full_results.best_params_, index=[0])\
     .assign(RMSLE=-full_results.best_score_)
 
 Fitting 3 folds for each of 81 candidates, totalling 243 fits
 Wall time: 32min 53s

使用n_samples的HalvingGridSearchCV

在第一个减半网格搜索中，我对资源使用了默认的“ n_samples”，并将min_resources设置为使用总资源的1/4，即365个样本。我没有使用默认的min_resources计算22个样本，因为它产生了可怕的结果。

对于两个减半的搜索，我使用Factor=2。此参数确定在连续迭代中使用的n_candidates和n_resources，并间接确定在搜索中利用的迭代总数。

该Factor的倒数决定了保留的n个候选对象的比例-在这种情况下为一半。所有其他候选人都将被丢弃。因此，正如您在下面的日志中看到的那样，我的搜索中的3次迭代有81、41和21个候选对象。

Factor与上一次迭代的n_resources的乘积确定n_resources。我的3次迭代搜索使用了365、730和1460个样本。

迭代的总数由n_resources可以增加多少倍而又不超过max_resources来确定。如果希望最终迭代使用所有资源，则需要将min_resources和Factor设置为max_resources的因数。

 %%time
 
 from sklearn.experimental import enable_halving_search_cv  
 from sklearn.model_selection import HalvingGridSearchCV
 
 FACTOR = 2
 MAX_RESOURCE_DIVISOR = 4
 
 n_samples = len(train_df)
 halving_results_n_samples =\
     HalvingGridSearchCV(resource='n_samples',
                         min_resources=n_samples//\
                         MAX_RESOURCE_DIVISOR,
                         factor=FACTOR,
                         **grid_search_params
                         )\
                         .fit(train_df, y_train)
                         
 n_iterations: 3
 n_required_iterations: 7
 n_possible_iterations: 3
 min_resources_: 365
 max_resources_: 1460
 aggressive_elimination: False
 factor: 2
 ----------
 iter: 0
 n_candidates: 81
 n_resources: 365
 Fitting 3 folds for each of 81 candidates, totalling 243 fits
 ----------
 iter: 1
 n_candidates: 41
 n_resources: 730
 Fitting 3 folds for each of 41 candidates, totalling 123 fits
 ----------
 iter: 2
 n_candidates: 21
 n_resources: 1460
 Fitting 3 folds for each of 21 candidates, totalling 63 fits
 Wall time: 34min 46s

此首次减半搜索未产生良好结果。实际上，它比详尽的搜索花费了更长的时间。使用我的compare_cv_best_params函数，我们看到它仅找到第九个最佳超参数集。

 from compare_functions import *
 
 compare_cv_best_params(full_results, *[halving_results_n_samples])\
  .style.applymap(lambda cell: ‘background: pink’ if cell == 9 else)

使用n_estimators的HalvingGridSearchCV

在第二个减半搜索中，我使用CatBoost的n_estimators作为资源，并设置了第一次迭代的min_resources以使用其中的四分之一，同时将Factor设置为2。

 %%time
 halving_results_n_estimators =\
     HalvingGridSearchCV(resource='model__n_estimators',                         
                          max_resources=1000,
                          min_resources=1000 // MAX_RESOURCE_DIVISOR,
                          factor=FACTOR,
                         **grid_search_params
                         )\
                         .fit(train_df, y_train)
 
 n_iterations: 3
 n_required_iterations: 7
 n_possible_iterations: 3
 min_resources_: 250
 max_resources_: 1000
 aggressive_elimination: False
 factor: 2
 ----------
 iter: 0
 n_candidates: 81
 n_resources: 250
 Fitting 3 folds for each of 81 candidates, totalling 243 fits
 ----------
 iter: 1
 n_candidates: 41
 n_resources: 500
 Fitting 3 folds for each of 41 candidates, totalling 123 fits
 ----------
 iter: 2
 n_candidates: 21
 n_resources: 1000
 Fitting 3 folds for each of 21 candidates, totalling 63 fits
 Wall time: 22min 59s

这种减半的搜索产生了我们希望看到的结果。它是在10分钟前完成的，因此比详尽的网格搜索快30％。重要的是，它还找到了最佳的超参数集。

 compare_cv_best_params(full_results, *[halving_results_n_samples, 
                                        halving_results_n_estimators])\
     .style.apply(lambda row: \
      row.apply(lambda col: \
      'background: lightgreen' if row.name == 2 else ''), \
      axis=1)

总结

我的HalvingGridSearchCV实验的结果好坏参半。使用默认的“ n_samples”资源会产生缓慢且次优的结果。如果您不使用大量样本，限制样本可能不会节省您的任何时间。

但是，使用CatBoost的n_estimators作为资源可以在更短的时间内产生最佳结果。这以我自己的经验进行跟踪，手动调整了梯度提升超参数。通常，我可以从验证日志中很快看出，是否值得在更多回合中增加超参数集。

作者：Kyle Gilde

原文地址：https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Faster-Hyperparameter-Tuning-with-Scikit-Learns-HalvingGridSearchCV/faster-hyperparameter-tuning-with-scikit-learn-s-h.ipynb

deephub翻译组

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-05-28，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法