炼丹笔记·干货
整理编辑:DOTA
目前看到的大多数特征工程方法都是针对数值特征的。本文介绍的Target Encoding是用于类别特征的。这是一种将类别编码为数字的方法,就像One-hot或Label-encoding一样,但和这种两种方法不同的地方在于target encoding还使用目标来创建编码,这就是我们所说的有监督特征工程方法。
Target Encoding
Target Encoding是任何一种可以从目标中派生出数字替换特征类别的编码方式。这种目标编码有时被称为平均编码。应用于二进制目标时,也被称为bin counting。(可能会遇到的其他名称包括:likelihood encoding, impact encoding, and leave-one-out encoding。)
每种方法都有其缺点,target encoding的缺点主要有:
鉴于以上缺点的存在,一般会加入平滑来进行处理。
encoding = weight * in_category + (1 - weight) * overall
weight = n / (n + m)
说了半天它的缺点和如何解决这些缺点,该方式的优点有哪些呢?
Beta Target Encoding
在kaggle竞赛宝典中,有一篇《Kaggle Master分享编码神技-Beta Target Encoding》,很好的介绍了Beta Target Encoding,该编码方案来源于kaggle曾经的竞赛Avito Demand Prediction Challenge 第14名solution。从作者开源出来的代码,我们发现该编码和传统Target Encoding不一样。
从作者的对比上我们可以看到,使用Beta Target Encoding相较于直接使用LightGBM建模的效果可以得到大幅提升。
01
Show me code
class BetaTargetEncoder(object):
def __init__(self, group):
self.group = group
self.stats = None
self.whoami = "DOTA"
# get counts from df
def fit(self, df, target_col):
# 先验均值
self.prior_mean = np.mean(df[target_col])
stats = df[[target_col, self.group]].groupby(self.group)
# count和sum
stats = stats.agg(['sum', 'count'])[target_col]
stats.rename(columns={'sum': 'n', 'count': 'N'}, inplace=True)
stats.reset_index(level=0, inplace=True)
self.stats = stats
# extract posterior statistics
def transform(self, df, stat_type, N_min=1):
df_stats = pd.merge(df[[self.group]], self.stats, how='left')
n = df_stats['n'].copy()
N = df_stats['N'].copy()
# fill in missing
nan_indexs = np.isnan(n)
n[nan_indexs] = self.prior_mean
N[nan_indexs] = 1.0
# prior parameters
N_prior = np.maximum(N_min-N, 0)
alpha_prior = self.prior_mean*N_prior
beta_prior = (1-self.prior_mean)*N_prior
# posterior parameters
alpha = alpha_prior + n
beta = beta_prior + N-n
# calculate statistics
if stat_type=='mean':
num = alpha
dem = alpha+beta
elif stat_type=='mode':
num = alpha-1
dem = alpha+beta-2
elif stat_type=='median':
num = alpha-1/3
dem = alpha+beta-2/3
elif stat_type=='var':
num = alpha*beta
dem = (alpha+beta)**2*(alpha+beta+1)
elif stat_type=='skewness':
num = 2*(beta-alpha)*np.sqrt(alpha+beta+1)
dem = (alpha+beta+2)*np.sqrt(alpha*beta)
elif stat_type=='kurtosis':
num = 6*(alpha-beta)**2*(alpha+beta+1) - alpha*beta*(alpha+beta+2)
dem = alpha*beta*(alpha+beta+2)*(alpha+beta+3)
# replace missing
value = num/dem
value[np.isnan(value)] = np.nanmedian(value)
return value
K-Fold Target Encoding
在Target Encoding的基础上,K-Flod 目标编码的基本思想源自均值目标编码,在均值目标编码中,分类变量由对应于它们的目标均值替换。
01
Show me code
class KFoldTargetEncoderTrain(base.BaseEstimator, base.TransformerMixin):
def __init__(self,colnames,targetName,
n_fold=5, verbosity=True,
discardOriginal_col=False):
self.colnames = colnames
self.targetName = targetName
self.n_fold = n_fold
self.verbosity = verbosity
self.discardOriginal_col = discardOriginal_col
self.whoami = "DOTA"
def fit(self, X, y=None):
return self
def transform(self,X):
assert(type(self.targetName) == str)
assert(type(self.colnames) == str)
assert(self.colnames in X.columns)
assert(self.targetName in X.columns)
mean_of_target = X[self.targetName].mean()
kf = KFold(n_splits = self.n_fold,
shuffle = False, random_state=2019)
col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
X[col_mean_name] = np.nan
for tr_ind, val_ind in kf.split(X):
X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]
X.loc[X.index[val_ind], col_mean_name] =
X_val[self.colnames].map(X_tr.groupby(self.colnames)
[self.targetName].mean())
X[col_mean_name].fillna(mean_of_target, inplace = True)
if self.verbosity:
encoded_feature = X[col_mean_name].values
print('Correlation between the new feature, {} and, {}
is {}.'.format(col_mean_name,self.targetName,
np.corrcoef(X[self.targetName].values,
encoded_feature)[0][1]))
if self.discardOriginal_col:
X = X.drop(self.targetName, axis=1)
return X
参考资料