我想使用一个热编码为我的简单模型。然而,无论我如何设置它,它似乎都会触发一个错误。首先,即使我有1.0.2版本的sklearn,但热编码并不是将字符串转换为浮动。现在的问题是,因为我的培训数据中的值与测试数据中的值长度不同。训练只有两个价值,测试有三个。我该怎么解决这个问题?准确的误差是一个系列的真值含糊不清。另一种想法的错误是重塑数据。
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
发布于 2022-01-01 11:19:30
这个问题应该与这样一个事实有关:您将categories
作为列表传递,而不是文档状态下的类似数组的列表(例如列表)。因此,下面的调整应该修复它。
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
作为进一步的说明,在处理测试数据有在培训集中找不到的类别的情况时,请注意指南的建议。
如果培训数据可能缺少分类功能,那么通常最好指定have _ignore=‘ignore’,而不是像上面那样手动设置类别。如果指定了句柄_ignore=‘ignore’,并且在转换过程中遇到未知类别,则不会引发任何错误,但是该特性产生的一个热编码列将为所有零(句柄_未知=“忽略”仅支持一个热编码):
最后,您可以看到属性categories_
(它指定了在拟合过程中确定的每个特性的类别)也是一个数组列表(这里的单个数组,因为您只对一个列进行热编码)。使用categories='auto'
的示例
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
使用自定义categories
的示例
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]
https://stackoverflow.com/questions/70549571
复制相似问题