文章/答案/技术大牛

发布

社区首页 >问答首页 >用于特征选择的Sklearn Chi2

问用于特征选择的Sklearn Chi2
EN

Stack Overflow用户

提问于 2018-08-05 15:39:53

回答 1查看 12.5K关注 0票数 12

我正在学习关于chi2的特性选择，并遇到了类似于这的代码

然而，我对chi2的理解是，较高的分数意味着该特性更独立(因此对模型不太有用)，所以我们会对分数最低的特性感兴趣。然而，使用scikit学习SelectKBest，选择器返回chi2分数最高的值。我对使用chi2测试的理解是否不正确？或者，chi2在sklearn中的分数是否产生了chi2统计之外的其他东西？

关于我的意思，请参阅下面的代码(除了结尾外，大部分是从上面的链接复制的)。

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np

# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest

scikit-learn

feature-selection

chi-squared

python

machine-learning

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-08-05 19:08:35

你的理解是相反的。

chi2检验的零假设是“两个范畴变量是独立的”。因此，chi2统计量的较高值意味着“两个范畴变量是相依的”，对分类更加有用。

SelectKBest为您提供了基于较高chi2值的最佳两个(k=2)特性。因此，您需要获得它提供的那些特性，而不是在chi2选择器上获得“其他特性”。

从chi2 ()获取chi2_selector.scores_统计数据和从chi2_selector.get_support()获得最佳特性是正确的。它将给你‘花瓣长度(厘米)’和‘花瓣宽度(厘米)’作为前两个特征，根据独立测试的chi2测试。希望它澄清了这个算法。

票数 17

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51695769

复制

相似问题

问用于特征选择的Sklearn Chi2
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于特征选择的Sklearn Chi2EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于特征选择的Sklearn Chi2
EN