机器学习中处理非平衡数据集的方法有多种,本文中聊聊处理非平衡数据集的方法。
采样之后的效果依赖于正负样本的比例,数据的特征以及分类器的特性。针对样本比较少的类别,可以通过随机复制该类别的样本来扩充样本,但是这种做法容易发生过拟合。
针对样本数比较少的类别,还可以采用另外一种方法,比如Synthetic Minority Over-samplingTechnique (SMOTE),这种方法即为人工生成样本,针对样本数比较少的类别,可以对其中的样本最近邻的k个样本组合起来生成新的样本。但是SMOTE不能处理名词性特征,此时可以利用SMOTE-NC (Synthetic Minority Over-samplingTechnique Nominal Continuous) 或 SMOTE-N (SyntheticMinority Over-sampling Technique Nominal)可以解决这种问题。还可以只对分界线处的样本进行过采样,比如borderline-SMOTE1 和borderline-SMOTE2。
针对样本数比较多的类别,一般采取的采样方法为random majority under-sampling (RUS),即随机丢弃样本数比较多的类别中的部分样本。显然,此时很容易丢失掉一些重要信息。one-sidedselection (OSS)这种方法可以丢弃比较冗余的样本或者噪声点。
当然,过采样和欠采样这两种方法可以同时使用,针对样本数比较多的类别,进行欠采样,相反,针对样本数比较少的类别,进行过采样。
上面介绍了针对不平衡样本集,如何从样本层面入手来改善分类效果。除了样本,也可以从损失函数入手来处理非平衡样本集。比如利用决策树算法时,可以通过调整决策阈值,或者决策树构建的过程中,将损失函数融入到选择分支属性的决策中,或者将损失函数融入到剪枝策略中。此时,选择分支属性时,不再最小化交叉熵,而是最小化总误差,最小化测试误差的和,同时最小化错分代价。
另外一种算法是BABoost,这种算法衍生于Adaboost算法。在Adaboost算法中,针对误分的样本,算法给他们的权重相同,但是,每一类的错分误差都不一样,通常样本数比较少的类别的错分误差比较大,而BABoost算法针对错分样本会给予比较大的权重,尤其是样本数比较少的类别中的样本。
参考资料
Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research16 (2002): 321-357.
Cieslak, David A., and Nitesh V. Chawla. "Learning decision trees for unbalanced data." In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 241-256. Springer, Berlin, Heidelberg, 2008.
Liu, Wei, Sanjay Chawla, David A. Cieslak, and Nitesh V. Chawla. "A robust decision tree algorithm for imbalanced data sets." In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 766-777. Society for Industrial and Applied Mathematics, 2010.
Cieslak, David A., T. Ryan Hoens, Nitesh V. Chawla, and W. Philip Kegelmeyer. "Hellinger distance decision trees are robust and skew-insensitive." Data Mining and Knowledge Discovery 24, no. 1 (2012): 136-158.
Miller, Benjamin A., Malina Kirn, Joseph R. Zipkin, and Jeremy Vila. Classifier Performance Estimation with Unbalanced, Partially Labeled Data. MIT Lincoln Laboratory Lexington United States, 2017.
Braytee, Ali. "Robust classification of high dimensional unbalanced single and multi-label datasets." PhD diss., 2018.
Tang, Yuchun, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. "SVMs modeling for highly imbalanced classification." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, no. 1 (2009): 281-288.
Mani, Inderjeet, and I. Zhang. "kNN approach to unbalanced data distributions: a case study involving information extraction." In Proceedings of workshop on learning from imbalanced datasets, vol. 126. 2003.
Chawla, Nitesh V. "Data mining for imbalanced datasets: An overview." In Data mining and knowledge discovery handbook, pp. 875-886. Springer, Boston, MA, 2009.
Batuwita, Rukshan, and Vasile Palade. "FSVM-CIL: fuzzy support vector machines for class imbalance learning." IEEE Transactions on Fuzzy Systems 18, no. 3 (2010): 558-571.
Lemaître, Guillaume, Fernando Nogueira, and Christos K. Aridas. "Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning." Journal of Machine Learning Research 18, no. 17 (2017): 1-5.
He, Haibo, and Edwardo A. Garcia. "Learning from imbalanced data." IEEE Transactions on knowledge and data engineering 21, no. 9 (2009): 1263-1284.
领取专属 10元无门槛券
私享最新 技术干货