一种加权的ML-kNN算法

时间:2022-08-26 01:38:33

一种加权的ML-kNN算法

摘要: ML-kNN算法利用贝叶斯概率修改传统的kNN算法以解决多标签问题,但这种基于概率统计的方法对覆盖率低的标签容易造成误判。因此,该文提出了一种加权ML-kNN算法,将样本与邻居之间的距离转化为权值来改这种误判。在三个基准数据集上进行对比实验,利用七个标准对其进行评测。实验结果表明,该加权ML-kNN算法整体上优于ML-kNN算法。

关键词:多标签学习; ML-kNN;距离加权;加权ML-kNN

中图分类号:TP18文献标识码:A文章编号:1009-3044(2012)04-0816-03

A Novel Weighted Multi-label kNN Algorithm

WANG Chun-yan

(Department of Computer Science and Technology, Tongji University, Shanghai 201804, China)

Abstract: ML-kNN modifies kNN by combining Bayesian probability to solve multi-label problem. However, based on probability statis? tics, ML-kNN doesn"t tend to assign those labels with low occurrence frequency for samples. Thus we proposed a novel weighted ML-kNN algorithm by concerning distances between a sample and its neighbors. We evaluated its performance on three benchmark datas? ets with seven metrics. The experiment results show that the weighted ML-kNN algorithm has better performance than ML-kNN on the whole.

Key words: Multi-label Learning; ML-kNN; Distance weight; Weighted ML-kNN

2加权ML-kNN算法

对多标签数据的每个标签,含该签的数据构成一个聚类。通常属于同一聚类的数据分布相对集中,而属于不同聚类的数据分布相对分散。首先将同一聚类中的任意两个数据之间的平均距离作为该聚类的密度。在未知样本和k个邻居构成的局部,针对各标签,为那些到未知数据的距离与聚类密度相近的邻居赋予较大的权值;相反,为那些到未知数据的距离与聚类密度不相近的邻居赋予较小的权值。其次,同时考虑不含有某标签的邻居对未知数据标签分布的影响,则加权ML-kNN的分类函数为:

3实验与分析

3.1实验设置

三个基准数据集的详细信息如表2所示。其中标签的势为样本的平均标签个数,标签密度为标签的势与标签总数的比值。

表2基准数据集

4结束语

本文对ML-kNN算法进行深入研究,并针对ML-kNN算法在数据分布不均匀的情况下容易误判标签的现象,通过引入距离权值提出了一种加权ML-kNN算法。实验结果表明,加权ML-kNN算法在很大程度上改进了多标签学习效果,尤其改进了多标签排序效果,但鉴于对多标签分类效果改进不明显的事实,本文作者将于后续的工作中继续深入研究。

参考文献

[1] Schapire R E, Singer Y. Boostexter: a boosting-based system for text categorization[J]. Machine Learning, 2000, 39(2-3): 135-168.

[2] Godbole S, Sarawagi S. Discriminative methods for multi-labeled classification[C]//Proceedings of the 8th Pacic-Asia Conference on Knowledge Discovery and Data Mining. 2004, 3056: 22-30.

[3]卫志华.中文文本多标签分类研究[D].上海:同济大学, 2010.

[4] Qi Guojun, Hua Xiansheng, Rui Yong, et al. Correlative multi-label video annotation[C]//Proceedings of the 15th international conference on Multimedia,2007: 17-26.

[5] Zhang Minling, Zhou Zhihua. ML-kNN: A lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.

[6] Clare A, King R. Knowledge discovery in multi-label phenotype data[C]//Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery,2001, 2168: 42-53.

[7] Blockeel H, Schietgat L, Struyf J, et al. Decision Trees for Hierarchical Multilabel Classification: A Case Study in Functional Genomics[J].Lecture Notes in Computer Science. 2006, 4213:18-29.

[8] Tsoumakas G, Katakis I, Vlahavas I. Mining Multi-label Data. Data Mining and Knowledge Discovery Handbook[M]//Maimon O, Rokach L.2nd ed.Springer, 2010: 667-685.

[9]苗夺谦,卫志华.中文文本信息处理的原理与应用[M].北京:清华大学出版社,2007:219-228.

[10] Hüllermeier E, Fürnkranz J, Cheng Weiwei, et al. Label ranking by learning pairwise preferences[J]. Artificial Intelligence, 2008, 172(16-17): 1897-1916.

[11] Elisseeff A, Weston J. A kernel method for multi-labelled classification[J].Advances in Neural Information Processing Systems, 2002, 14: 681-687.

[12] Read J. A pruned problem transformation method for multi-label classification[C].Proceedings of New Zealand Computer Science Re? search Student Conference,2008: 143-150.

[13] Tsoumakas G, Vlahavas I. Random k-labelsets: An ensemble method for multilabel classification[C].Proceedings of the 18th European Conference on Machine Learning,2007:406-417.

[14] Thabtah F, Cowling P, Peng Yonghong. MMAC: A new multi-class, multi-label associative classification approach[C].Proceedings of the 4th IEEE International Conference on Data Mining,2004: 217-224.

上一篇:计算机程序抄袭检测系统的设计方案 下一篇:可重构哈希算法芯片的设计与实现