前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >讯飞广告反欺诈赛的王牌模型catboost介绍

讯飞广告反欺诈赛的王牌模型catboost介绍

作者头像
MeteoAI
发布2019-09-25 14:43:19
5.5K0
发布2019-09-25 14:43:19
举报
文章被收录于专栏:MeteoAI

前段时间,MeteoAI小伙伴参加了讯飞移动广告反欺诈算法挑战赛算法挑战大赛[1],最终取得了复赛14/1428名的成绩。这是第一个我们从头到尾认真刷完的比赛,排名前1%其实我们觉得也还算可以,但还是比较遗憾与获奖区(前十名)擦肩而过......整个过程也是相当的波澜起伏,最高排名我们11名,可谓就是差一点点点就进入头部梯队了。不过通过这次比赛我们也确实收获了不少。

你一个搞气象的去凑什么热闹???

首先,模型大家都差不多,比如这个比赛,大家基本用的都是catboost。最终比拼的是数据挖掘的能力,有时候当然还是要一些灵感和运气。充分的EDA和精细的特征工程往往是在这类数据竞赛最终胜出的关键。所以大家一定要有意识培养数据分析,数据挖掘,特征工程,业务理解等多方面的能力,不能只会model.fit()model.predict(),因为这个真的谁都可以会。

获取本文代码见文末~

嗯...... 今天我们还是讲讲这个比赛的大杀器模型catboost的model.fit()model.predict(),因为这个真的谁都可以会,特征工程数据挖掘那一套.....讲真我们也还没完全摸透啊就不瞎分享了。讯飞的这个比赛,大多数的特征都是类型的特征,catboost非常擅长处理这些类别特征,结果完爆xgboost和lightgbm等常用模型。

用过sklearn进行机器学习的同学应该都知道,在用sklearn进行机器学习的时候,我们需要对类别特征进行预处理,如label encoding, one hot encoding等,因为sklearn无法处理类别特征,会报错。

而俄罗斯Yandex公司开源的 CatBoost[2]模型可直接对类别特征进行处理,在很多公开数据集上的表现都相当优异。从它的名字也可以看出来(CatBoost = Category and Boosting),它的优势是对类别特征的处理[3],同时结果更加robust,不需要费力去调参也能获得非常不错的结果,关于调参可参考链接[4]。

catboost: Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality.

1. Install

首先安装相应的工具:

代码语言:javascript
复制
# 用pip
pip install catboost
# 或者用conda
conda install -c conda-forge catboost 

# 安装jupyter notebook中的交互组件,用于交互绘图
pip install ipywidgets 
jupyter nbextension enable --py widgetsnbextension

2. Preprocessing

Pool

Pool是catboost中的用于组织数据的一种形式,也可以用numpy array和dataframe。但更推荐Pool,其内存和速度都更优。

关于Pool[5]的用法:

代码语言:javascript
复制
class Pool(data, 
           label=None,
           cat_features=None,
           column_description=None,
           pairs=None,
           delimiter='\t',
           has_header=False,
           weight=None, 
           group_id=None,
           group_weight=None,
           subgroup_id=None,
           pairs_weight=None
           baseline=None,
           feature_names=None,
           thread_count=-1)
代码语言:javascript
复制
from catboost import CatBoostClassifier, Pool

train_data = Pool(data=[[1, 4, 5, 6],
                        [4, 5, 6, 7],
                        [30, 40, 50, 60]],
                  label=[1, 1, -1],
                  weight=[0.1, 0.2, 0.3])
train_data 
# <catboost.core.Pool at 0x1a22af06d0>

model = CatBoostClassifier(iterations=10)
model.fit(train_data)
preds_class = model.predict(train_data)

FeaturesData

创建Pool有多种方式,而通过FeaturesData[6]构建Pool是更优的方式。

代码语言:javascript
复制
class FeaturesData(num_feature_data=None,
                   cat_feature_data=None,
                   num_feature_names=None,
                   cat_feature_names=None)

CatBoostClassifier[7] with FeaturesData[8]:

代码语言:javascript
复制
import numpy as np
from catboost import CatBoostClassifier, FeaturesData
# Initialize data
cat_features = [0,1,2]
train_data = FeaturesData(
    num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([["a", "b"], ["a", "b"], ["c", "d"]], dtype=object)
)
train_labels = [1,1,-1]
test_data = FeaturesData(
    num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([["a", "b"], ["a", "d"]], dtype=object))

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data, train_labels)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')

CatBoostClassifier[9] with Pool[10] and FeaturesData[11]:

代码语言:javascript
复制
import numpy as np
from catboost import CatBoostClassifier, FeaturesData, Pool
# Initialize data
train_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[1, 4, 5, 6], 
                                   [4, 5, 6, 7], 
                                   [30, 40, 50, 60]], 
                                   dtype=np.float32),
        cat_feature_data=np.array([["a", "b"], 
                                   ["a", "b"], 
                                   ["c", "d"]], 
                                   dtype=object)
    ),
    label=[1, 1, -1]
)
test_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[2, 4, 6, 8], 
                                   [1, 4, 50, 60]], 
                                   dtype=np.float32),
        cat_feature_data=np.array([["a", "b"], 
                                   ["a", "d"]], 
                                   dtype=object)
    )
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations = 2, 
                           learning_rate = 1,
                           depth = 2, 
                           loss_function = 'Logloss')
# Fit model
model.fit(train_data)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')

3. Case

下面利用catboost内置的titanic数据集做演示。

库和数据集准备

首先导入必要的库和做数据准备,这里忽略最为重要的特征工程部分,仅仅作为演示:

代码语言:javascript
复制
from catboost.datasets import titanic
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import accuracy_score

# 导入数据
train_df, test_df = titanic()

# 查看缺测数据:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

# 填充缺失值:
train_df.fillna(-999, inplace=True)
test_df.fillna(-999, inplace=True)

# 拆分features和label
X = train_df.drop('Survived', axis=1)
y = train_df.Survived

# train test split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)
X_test = test_df

# indices of categorical features
categorical_features_indices = np.where(X.dtypes != np.float)[0]

进行模型训练

catboost提供的默认参数可以提供非常好的baseline。所以不妨从默认参数开始。

代码语言:javascript
复制
model = CatBoostClassifier(
    custom_metric=['Accuracy'],
    random_seed=666,
    logging_level='Silent'
)
# custom_metric <==> custom_loss

model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
    logging_level='Verbose',  # you can comment this for no text output
    plot=True
);

# OUTPUT:
"""
...
...
...
bestTest = 0.3792389991
bestIteration = 342

Shrink model to first 343 iterations.
"""

应用模型进行预测

代码语言:javascript
复制
predictions = model.predict(X_test)
predictions_probs = model.predict_proba(X_test)
print(predictions[:10])
print(predictions_probs[:10])
# OUTPUT:
"""
[0. 0. 0. 0. 1. 0. 1. 0. 1. 0.]
[[0.90866781 0.09133219]
 [0.63668717 0.36331283]
 [0.95333247 0.04666753]
 [0.91051481 0.08948519]
 [0.28010084 0.71989916]
 [0.94618962 0.05381038]
 [0.35536101 0.64463899]
 [0.81843278 0.18156722]
 [0.32829247 0.67170753]
 [0.92653732 0.07346268]]
"""

选择最好的模型输出(use_best_model)

在进行模型训练的时候,use_best_model最好用默认设置True,这意味着最后的模型训练结果是收缩在最佳的迭代次数上的(可以用model.tree_count_获得最佳的迭代次数),如果use_best_model设置为False,则 model.tree_count_ = iteration。如下面的例子:

代码语言:javascript
复制
# 数据准备的部分见库和数据集准备部分
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': 'Accuracy',
    'random_seed': 666,
    'logging_level': 'Silent',
    'use_best_model': False
}
# train
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
# validation
validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)

# train with 'use_best_model': False
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)

# train with 'use_best_model': True
best_model_params = params.copy()
best_model_params.update({'use_best_model': True})
best_model = CatBoostClassifier(**best_model_params)
best_model.fit(train_pool, eval_set=validate_pool);

# show result
print('Simple model validation accuracy: {:.4}, and the number of trees: {}'.format(
    accuracy_score(y_validation, model.predict(X_validation)), model.tree_count_))
print('')
print('Best model validation accuracy: {:.4}, and the number of trees: {}'.format(
    accuracy_score(y_validation, best_model.predict(X_validation)),best_model.tree_count_))

用Early Stopping防止过拟合、节约训练时间

earlystopping是常用的防止模型过拟合的方式,同时也可以大幅度的节约训练时间。

代码语言:javascript
复制
params.update({'iterations':1000})
params
# OUTPUT:
"""
{'iterations': 1000,
 'learning_rate': 0.1,
 'eval_metric': 'Accuracy',
 'random_seed': 42,
 'logging_level': 'Silent',
 'use_best_model': False}
"""
代码语言:javascript
复制
%%time
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)
"""
CPU times: user 2min 11s, sys: 52.1 s, total: 3min 3s
Wall time: 27.8 s
"""
代码语言:javascript
复制
%%time
earlystop_model_1 = CatBoostClassifier(**params)
earlystop_model_1.fit(train_pool, eval_set=validate_pool, early_stopping_rounds=200, verbose=20)
"""
CPU times: user 46.6 s, sys: 15.6 s, total: 1min 2s
Wall time: 9.2 s
"""
代码语言:javascript
复制
%%time
earlystop_params = params.copy()
earlystop_params.update({
    'od_type': 'Iter',
    'od_wait': 200,
    'logging_level': 'Verbose'    
})
earlystop_model_2 = CatBoostClassifier(**earlystop_params)
earlystop_model_2.fit(train_pool, eval_set=validate_pool);
"""
CPU times: user 49.6 s, sys: 19.9 s, total: 1min 9s
Wall time: 10.3 s
"""

也可以直接设置参数early_stopping_rounds:

early_stopping_rounds: Set the overfitting detector type to 'Iter' ( 'od_type': 'Iter') and stop the training after the specified number of iterations since the iteration with the optimal metric value.

代码语言:javascript
复制
earlystop_params = params.copy()
earlystop_params.update({
    'early_stopping_rounds': 200,
    'logging_level': 'Verbose'    
})

输出结果:

代码语言:javascript
复制
print('Simple model tree count: {}'.format(model.tree_count_))
print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')
print('Early-stopped model 1 tree count: {}'.format(earlystop_model_1.tree_count_))
print('Early-stopped model 1 validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model_1.predict(X_validation))
))
print('')
print('Early-stopped model 2 tree count: {}'.format(earlystop_model_2.tree_count_))
print('Early-stopped model 2 validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model_2.predict(X_validation))
))

"""
Simple model tree count: 1000
Simple model validation accuracy: 0.8206

Early-stopped model 1 tree count: 393
Early-stopped model 1 validation accuracy: 0.8296

Early-stopped model 2 tree count: 393
Early-stopped model 2 validation accuracy: 0.8296
"""

可以看到用earlystopping后训练时间更短,可以有效避免过拟合,得到的模型准确率更高。

Feature Importance

显示特征重要性:

代码语言:javascript
复制
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)

feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print('{}: {}'.format(name, score))

"""
Sex: 48.21061102095765
Pclass: 17.045040317206695
Age: 7.611166250335819
Parch: 5.220861205417323
SibSp: 5.16579933751564
Embarked: 4.968165121183137
Fare: 4.858908301370388
Cabin: 4.140024994004162
Ticket: 2.7794234520091585
PassengerId: 0.0
Name: 0.0
"""
# 设置参数:prettified=True 获得更多的输出信息
importances = model.get_feature_importance(prettified=True)
print(importances)

封装函数,实现更好的显示方式。

代码语言:javascript
复制
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
%matplotlib inline

def func_plot_importance(df_imp):

    sns.set(font_scale=1)
    fig = plt.figure(figsize=(3, 3), dpi=100)
    ax = sns.barplot(
        x="Importance", y="Features", data=df_imp, label="Total", color="b")
    ax.tick_params(labelcolor='k', labelsize='10', width=3)
    plt.show()

def display_importance(model_out, columns, printing=True, plotting=True):
    importances = model_out.feature_importances_
    indices = np.argsort(importances)[::-1]
    importance_list = []
    for f in range(len(columns)):
        importance_list.append((columns[indices[f]], importances[indices[f]]))
        if printing:
            print("%2d) %-*s %f" % (f + 1, 30, columns[indices[f]],
                                    importances[indices[f]]))
    if plotting:
        df_imp = pd.DataFrame(
            importance_list, columns=['Features', 'Importance'])
        func_plot_importance(df_imp)


display_importance(model_out=model, columns=X_train.columns)

Cross Validation[12]

代码语言:javascript
复制
cv(pool=None, 
   params=None, 
   dtrain=None, 
   iterations=None, 
   num_boost_round=None,
   fold_count=3, 
   nfold=None,
   inverted=False,
   partition_random_seed=0,
   seed=None, 
   shuffle=True, 
   logging_level=None, 
   stratified=None,
   as_pandas=True,
   metric_period=None,
   verbose=None,
   verbose_eval=None,
   plot=False,
   early_stopping_rounds=None,
   folds=None)

需要先将数据封装Pool里,然后再进行交叉验证。

代码语言:javascript
复制
cv_params = model.get_params()
cv_params.update({
    'loss_function': 'Logloss'
})
cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    plot=True
)

print('Best validation accuracy score: {:.3f}±{:.3f} on step {}'.format(
    np.max(cv_data['test-Accuracy-mean']),
    cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],
    np.argmax(cv_data['test-Accuracy-mean'])))
# Best validation accuracy score: 0.833±0.007 on step 286
代码语言:javascript
复制
best_value = np.min(np.array(cv_data['test-Logloss-mean']))
best_iter_idx = np.argmin(np.array(cv_data['test-Logloss-mean']))

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter_idx],
    best_iter_idx+1))

注意:iteration = index+1

用holdout做验证容易低估或高估我们的模型预测偏差,用交叉验证是更好的方式。

Using Baseline

可以实现在之前预训练的基础上继续训练。

代码语言:javascript
复制
params = {'iterations': 200,
          'learning_rate': 0.1,
          'eval_metric': 'Accuracy',
          'random_seed': 42,
          'logging_level': 'Verbose',
          'use_best_model': False}

current_params = params.copy()
current_params.update({
    'iterations': 10
})
model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)
# Get baseline (only with prediction_type='RawFormulaVal')
baseline = model.predict(X_train, prediction_type='RawFormulaVal')
# Fit new model
model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);

Snapshot

可用于在中断后恢复之前训练状态,以及在之前训练的基础上进行继续训练。假如我们的训练会持续较长时间,设置snapshot可以有效防止我们的电脑或者服务器在过程中重启或者其他故障而导致我们的训练前功尽弃。

代码语言:javascript
复制
params_with_snapshot = params.copy()
params_with_snapshot.update({
    'iterations': 5,
    'learning_rate': 0.5,
    'logging_level': 'Verbose'
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

params_with_snapshot.update({
    'iterations': 10,
    'learning_rate': 0.1,
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

训练的中间信息会默认保存在catboost_info/目录下,如需修改可以通过train_dir参数进行设置。

代码语言:javascript
复制
#!rm 'catboost_info/snapshot.bkp'
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=40,
    random_seed=43
)
model.fit(
    train_pool,
    eval_set=validate_pool,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    logging_level='Verbose'
)

DIY Loss AND Metric Function

注意区分两个参数:

(1) loss_function, Alias: objective.

训练模型的优化目标函数。

(2) custom_metric, Alias: custom_loss

在训练时输出的评估指标,仅作为模型训练状态的参照,而非实际的优化目标。

(3) eval_metric

用于监测模型过拟合以及作为选择最优模型的参考。(loss_functioneval_metric可以不一致,比如训练用Logloss,用AUC选择最佳模型/最佳迭代次数)

代码语言:javascript
复制
model = CatBoostClassifier(
    iterations=500,
    loss_function= 'Logloss',
    custom_metric=['Accuracy','AUC'],
    eval_metric='F1',
    random_seed=666
)

# custom_metric <==> custom_loss
# 只作为评估参考,而非优化目标

model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
    verbose=50,
    plot=True
);

不同参数的测试:

代码语言:javascript
复制
# custom_metric=['Accuracy','AUC'], eval_metric='F1',
model.best_iteration_, model.best_score_, model.tree_count_
"""
(219,
 {'learn': {'Accuracy': 0.9491017964071856,
   'Logloss': 0.1747009677350333,
   'F1': 0.9294605809128631},
  'validation': {'Accuracy': 0.8385650224215246,
   'Logloss': 0.39249638575985446,
   'F1': 0.7906976744186046,
   'AUC': 0.9018111688747275}},
 220)
"""

# custom_metric=['Accuracy','AUC'], eval_metric='Logloss',
model.best_iteration_, model.best_score_, model.tree_count_    
"""
(152,
 {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333},
  'validation': {'Accuracy': 0.8385650224215246,
   'Logloss': 0.39249638575985446,
   'AUC': 0.9018111688747275}},
 153)
"""

# custom_metric=['Accuracy','AUC'], eval_metric='Accuracy',
model.best_iteration_, model.best_score_, model.tree_count_    
"""
(219,
 {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333},
  'validation': {'Accuracy': 0.8385650224215246,
   'Logloss': 0.39249638575985446,
   'AUC': 0.9018111688747275}},
 220)
"""

1. User Defined Objective Function[13]

代码语言:javascript
复制
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        """
        approxes, targets, weights are indexed containers of floats
        (containers which have only __len__ and __getitem__ defined).
        weights parameter can be None.

        To understand what these parameters mean, assume that there is
        a subset of your dataset that is currently being processed.
        approxes contains current predictions for this subset,
        targets contains target values you provided with the dataset.

        This function should return a list of pairs (der1, der2), where
        der1 is the first derivative of the loss function with respect
        to the predicted value, and der2 is the second derivative.

        In our case, logloss is defined by the following formula:
        target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))
        where sigmoid(x) = 1 / (1 + e^(-x)).
        """
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        result = []
        for index in range(len(targets)):
            e = np.exp(approxes[index])
            p = e / (1 + e)
            der1 = (1 - p) if targets[index] > 0.0 else -p
            der2 = -p * (1 - p)
            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]
            result.append((der1, der2))
        return result

model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function=LoglossObjective(), 
    eval_metric="Logloss"
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

2. User Defined Metric Function[14]

代码语言:javascript
复制
class LoglossMetric(object):
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

    def is_max_optimal(self):
        return False

    def evaluate(self, approxes, target, weight):
        """        
        approxes is a list of indexed containers
        (containers with only __len__ and __getitem__ defined),
        one container per approx dimension.
        Each container contains floats.
        weight is a one dimensional indexed container.
        target is float.

        weight parameter can be None.
        Returns pair (error, weights sum)
        """
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])
        approx = approxes[0]
        error_sum = 0.0
        weight_sum = 0.0
        for i in range(len(approx)):
            w = 1.0 if weight is None else weight[i]
            weight_sum += w
            error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i])))

        return error_sum, weight_sum

model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function="Logloss",
    eval_metric=LoglossMetric()
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

训练后查看模型在新数据集上的表现(Eval Metrics)

CatBoost有一个eval_metrics的方法,可以用于计算训练后的模型某一指定指标在每一轮迭代的表现,同时也可以可视化。可用于训练后的模型在新数据集上的评估。

代码语言:javascript
复制
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
eval_metrics = model.eval_metrics(validate_pool, ['AUC','F1','Logloss'], plot=True)
# 返回一个dict,有'AUC','F1','Logloss'这几个键

对比不同参数配置下模型的学习过程

代码语言:javascript
复制
from catboost import MetricVisualizer

model1 = CatBoostClassifier(iterations=100, depth=5, train_dir='model_depth_5/', logging_level='Silent')
model1.fit(train_pool, eval_set=validate_pool)

model2 = CatBoostClassifier(iterations=100, depth=8, train_dir='model_depth_8/', logging_level='Silent')
model2.fit(train_pool, eval_set=validate_pool);

widget = MetricVisualizer(['model_depth_5', 'model_depth_8'])
widget.start()

保存和导入模型

将模型保存为二进制文件。

代码语言:javascript
复制
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
model.save_model('catboost_model.dump')
model = CatBoostClassifier()
model.load_model('catboost_model.dump');

print(model.get_params())
print(model.random_seed_)
print(model.learning_rate_)

模型的分析与理解

shap

调参

我们可以通过交叉验证和learning curve得到最佳的iterations (boosting steps),但还有一些重要的参数需要我们额外调整。较为重要的比如l2_leaf_reg, learning_rate等,更多的参数说明请参考官网[15]。下面用hyperopt进行调参演示:

代码语言:javascript
复制
import hyperopt
from catboost import CatBoostClassifier, Pool, cv

def hyperopt_objective(params):

    model = CatBoostClassifier(
        l2_leaf_reg=int(params['l2_leaf_reg']),
        learning_rate=params['learning_rate'],
        iterations=100,
        eval_metric='Accuracy',
        loss_function= 'Logloss',
        random_seed=42,
        logging_level='Silent'
    )

    cv_data = cv(
        Pool(X, y, cat_features=categorical_features_indices),
        model.get_params()
    )
    best_accuracy = np.max(cv_data['test-Accuracy-mean'])

    return 1 - best_accuracy # as hyperopt minimises
代码语言:javascript
复制
from numpy.random import RandomState

params_space = {
    'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),
    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
}

trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_objective,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=10,
    trials=trials,
    rstate=RandomState(123)
)

print(best)

"""
100%|██████████| 10/10 [01:02<00:00,  6.69s/it, best loss: 0.1728395061728395]
{'l2_leaf_reg': 3.0, 'learning_rate': 0.36395429572850696}
"""
代码语言:javascript
复制
model = CatBoostClassifier(
    l2_leaf_reg=int(best['l2_leaf_reg']),
    learning_rate=best['learning_rate'],
    iterations=100,
    eval_metric='Accuracy',
    loss_function= 'Logloss',
    random_seed=42,
    logging_level='Silent'
)
cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())

print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))
print(f"Best iteration: {int(np.argmax(cv_data['test-Accuracy-mean'])+1)}")

"""
Precise validation accuracy score: 0.8271604938271605
Best iteration: 49
"""

一些常用参数的说明,更多参数请查阅官网文档Python Training Parameters[16]:

1.iterations + learning_rate

默认状况下会迭代1000次, learning_rate是根据数据集及iterations参数自动定义的。如果减小iterations,最好相应的增大 learning_rate,使得结果收敛。

如果在训练中发现结果没收敛,可以考虑提高 learning_rate;如果发现过拟合了,则需要减小 learning_rate

2.boosting_type

默认是Ordered,效果不错,在小数据集推荐使。但是速度会比Plain模式慢。

3.bootstrap_type[17]4.one_hot_max_size

在类别特征转换时,对取值少于或等于one_hot_max_size的类别特征,采用OneHot编码,对其他类别特征采用更多统计值。通常OneHot是更快的方式,而计算统计值耗时更多,所以为了提高速度,我们可以给该参数设置较大的值。

5.rsm: Alias: colsample_bylevel, float(0,1]

参与每次分裂选择的特征比例。在你有好几百维以上特征的情况下,这个参数非常有效,可以有效的加速训练同时保持较好的结果。如果特征较少,可以不用该参数。

假设你有很多的特征,你设置了rsm=0.1,通常你需要增加20%的迭代次数使得模型收敛,但是每次的迭代速度将会比原来快10倍。

6.max_ctr_complexity

特征组合的最大特征数量。catboost用贪心算法做类别特征的特征组合,非常耗时。设置 max_ctr_complexity = 1 取消特征组合,设置 max_ctr_complexity = 2 只做两个特征的组合。

7.depth

树深。大多数情况下,在4-10之间,可以在6-10之间多加调试。

8.l2_leaf_reg

L2正则系数,多尝试不同的取值。

9.random_strength

可以防止过拟合。在分裂过程计算各特征score时加入的随机因子。本来score是确定性的,我们加入一个满足均值为0,方差为1*random_strength(方差随着迭代减小)分布的误差项来产生随机性,防止过拟合。

10.bagging_temperature: [0,inf)

bootstrap_type[18]为Bayesian时有效,用于设置Bayesian bootstrap的参数。当取值为1时,会从指数分布中采样权值;当为0时,所有的权重为1。这个值越大,则bootstrap越aggressive。

11.has_time

如果数据集是时间序列,需要考虑样本的先后关系,则可以设置该参数。则Transforming categorical features to numerical features[19] 和 Choosing the tree structure[20] 的阶段,数据会保持原有顺序,或根据Timestamp的列排列(如果在input data中声明),而不会进行shuffle操作 (random permutations)。

12.grow_policy: 可选值为 [SymmetricTree, Depthwise, Lossguide]

决策树的生长方式,默认是level-wise的symmetric trees。

min_data_in_leaf Alias: min_child_samples: 支持Depthwise, Lossguide

max_leaves Alias: num_leaves: 支持Lossguide

如果是GPU环境:可以设置task_type="GPU"

border_count: Alias: max_bin. 对数值型特征的切分次数,在CPU上默认值为254,在GPU上默认值为128。在CPU上该参数不会显著影响到训练速度,在GPU上该参数会显著影响到训练的速度,如果为了更好的训练质量可以设置为254,如果为了更快,可以降低该参数的值。

例如:

更快的模型

代码语言:javascript
复制
from catboost import CatBoost
fast_model = CatBoostClassifier(
    random_seed=63,
    iterations=150,
    learning_rate=0.01,
    boosting_type='Plain',
    bootstrap_type='Bernoulli',
    subsample=0.5,
    one_hot_max_size=20,
    rsm=0.5,
    leaf_estimation_iterations=5,
    max_ctr_complexity=1,
    border_count=32)

fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    plot=True
)

更准确的模型

代码语言:javascript
复制
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=1000,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    eval_set=(X_validation, y_validation),
    plot=True
)

本文代码:

https://github.com/zhangqibot/python_data_basic/tree/master/machine_learning/catboost

往期推荐

Deecamp 夏令营 AI 降水预测总结

Python绘制气象实用地图(附代码和测试数据)

斯坦福大学使用机器学习做次季节温度/降水预报

Nature(2019)-地球系统科学领域的深度学习及其理解

交叉新趋势|采用神经网络与深度学习来预报降水、温度等案例(附代码/数据/文献)

REFERENCE

[1] 移动广告反欺诈算法挑战赛算法挑战大赛: http://challenge.xfyun.cn/2019/gamedetail?type=detail/mobileAD [2] CatBoost: https://catboost.yandex/ [3] 对类别特征的处理: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html [4] 链接: https://catboost.ai/docs/concepts/parameter-tuning.html [5、10] Pool: https://catboost.ai/docs/concepts/python-reference_pool.html [6、8、12] FeaturesData: https://catboost.ai/docs/concepts/python-features-data__desc.html [7、9] CatBoostClassifier: https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html#python-reference_catboostclassifier [12] Cross Validation: https://catboost.ai/docs/concepts/python-reference_cv.html [13] User Defined Objective Function: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-objective-function [14] User Defined Metric Function: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-loss-function-eval-metric [15] 参考官网: https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list [16] Python Training Parameters: https://catboost.ai/docs/concepts/python-reference_parameters-list.html [17-18] bootstrap_type: https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html [19] Transforming categorical features to numerical features: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html#algorithm-main-stages_cat-to-numberic [20] Choosing the tree structure: https://catboost.ai/docs/concepts/algorithm-main-stages_choose-tree-structure.html#algorithm-main-stages_choose-tree-structure [21] catboost in github: https://github.com/catboost/catboost [22] catboost paper: https://arxiv.org/pdf/1706.09516.pdf [23] catboost算法细节: https://catboost.ai/docs/concepts/algorithm-main-stages.html

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-09-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 MeteoAI 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1. Install
  • 2. Preprocessing
    • Pool
      • FeaturesData
      • 3. Case
        • 库和数据集准备
          • 进行模型训练
            • 应用模型进行预测
              • 选择最好的模型输出(use_best_model)
                • 用Early Stopping防止过拟合、节约训练时间
                  • Feature Importance
                    • Cross Validation[12]
                      • Using Baseline
                        • Snapshot
                          • DIY Loss AND Metric Function
                            • 1. User Defined Objective Function[13]
                              • 2. User Defined Metric Function[14]
                                • 训练后查看模型在新数据集上的表现(Eval Metrics)
                                  • 对比不同参数配置下模型的学习过程
                                    • 保存和导入模型
                                      • 模型的分析与理解
                                        • 调参
                                          • 更快的模型
                                            • 更准确的模型
                                            • 本文代码:
                                            • https://github.com/zhangqibot/python_data_basic/tree/master/machine_learning/catboost
                                            • 往期推荐
                                            • REFERENCE
                                            领券
                                            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档