
时间:2022-08-25 10:19:14




中图分类号: TP391; TP18


Abstract: Some algorithms in pattern recognition and machine learning can only deal with discrete attribute values, while in real world many data sets consist of continuous data values. An unsupervised method was proposed according to the question of discretization. First, Kmeans method was employed to partition the data set into multiple subgroups to acquire label information, and then a supervised discretization algorithm was applied to the divided data set. When the process was repeatedly executed, multiple discrete results were obtained. These results were then integrated with an ensemble technique. Finally, the minimum subintervals were merged after priority dimensions and adjacent intervals were determined according to the neighbor relationship of data, where the number of subintervals was automatically estimated by preserving the correlation so that the intrinsic structure of the data set was maintained. The experimental results of applying categorical clustering algorithms such as spectral clustering demonstrate the feasibility and effectiveness of the proposed method. For example, its clustering accuracy improves by about 33% on average than other four methods. Discrete data attained can be used for some data mining algorithm, such as ID3 decision tree algorithm.

Key words: unsupervised discretization; ensemble learning; categorical data; similarity; spectral clustering






针对无标签的数据进行离散化,即无监督的算法,在Dougherty等[8]提出的算法中最简单的为等宽与等频率的算法,虽然都易于实现,但都忽视了数据分布信息,因而区间边界的确定不具有代表性;Kmeans离散化方法,对于数值型的离散化而言,采用欧几里得距离作为区间划分的依据缺乏理论根据。此外,该算法依靠用户来指定区间数目,不能自动确定区间数;保持关系的离散化方法,考虑属性间的相关性通过主成分分析(Principal Component Analysis, PCA)降维的方法来离散,对于高维非线性可分的数据离散效果不佳;基于混合概率模型的无监督离散化方法[9],将数值属性的值域划分为若干子区间,再通过贝叶斯信息准则自动的寻求子区间数目和划分方法,在离散化过程中针对不同的属性离散化时间可能相差较大。

目前应用最广泛的有:监督算法是类属性关系最大化(ClassAttribute Interdependence Maximization, CAIM)算法[10],综合考虑类与属性之间的相关性,通过最大化相互依赖性来选择合适的切断点,能很好地保持数据的内在结构,可能会导致划分的区间数目与类数之间过拟合;以及后来提出的基于类属性应变系数(ClassAttribute Contingency Coefficient, CACC)的离散化算法[11],即类属性相关系数的离散化算法。






[1]SRIKANT R, AGRAWAL R. Mining quantitative association rules in large relational tables [C]// Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1996: 1-12.

[2]CATLETT J. On changing continuous attributes into ordered discrete attributes [C]// Proceedings of the European Working Session on Learning on Machine Learning, LNCS 482. Berlin: Springer, 1991: 164-178.

[3]MEHTA S, PARTHASARATHY S, YANG H. Toward unsupervised correlation preserving discretization [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1174-1185.

[4]KERBER R. ChiMerge: discretization of numeric attributes [C]// Proceedings of the Tenth National Conference on Artificial Intelligence. Menlo Park: AAAI Press, 1992: 123-128.

[5]LIU H, SETIONO R. Chi2: feature selection and discretization of numeric attributes [C]// Proceedings of the Seventh IEEE International Conference on Tools with Artificial Intelligence. Washington, DC: IEEE Computer Society, 1995: 388-391.

[6]YANG Y, WEBB G I. Discretization for naiveBayes learning: managing discretization bias and variance [J]. Machine Learning, 2009, 74(1): 39-74.

[7]RUIZ F J, ANGULO C, AGELL N. IDD: a supervised interval distancebased method for discretization [J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(9): 1230-1238.

[8]DOUGHERTY J, KOHAVI R, SAHAMI M. Supervised and unsupervised discretization of continuous features [C]// Proceedings of the Twelfth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1995: 194-202.

[9]LI G. An unsupervised discretization algorithm based on mixture probabilistic model [J]. Chinese Journal of Computers, 2002, 25(2): 158-164.(李刚.基于混合概率模型的无监督离散化算法[J].计算机学报,2002,25(2):158-164.)

[10]KURGAN L A, CIOS K J. CAIM discretization algorithm [J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(2): 145-153.

[11]TSAI C J, LEE C I, YANG W. A discretization algorithm based on classattribute contingency coefficient [J]. Information Sciences, 2008, 178(3): 714-731.

[12]SCHMIDBERGER G, FRANK E. Unsupervised discretization using treebased density estimation [C]// PKDD 2005: Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, LNCS 3721. Berlin: Springer, 2005: 240-251.

[13]BIBA M, ESPOSITO F, FERILLI S, et al. Unsupervised discretization using kernel density estimation [C]// Proceedings of the 20th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2007: 697-701.

[14]BORIAH S, CHANDOLA V, KUMAR V. Similarity measures for categorical data: a comparative evaluation [C]// Proceedings of the 8th SIAM International Conference on Data Mining. Philadelphia: SIAM, 2008: 243-254.

[15]ZHANG S, WONG H S, SHEN Y. Generalized adjusted rand indices for cluster ensembles [J]. Pattern Recognition, 2012, 45(6): 2214-2226.

[16]NG A Y, JORDAN M I, WEISS Y. On spectral clustering: analysis and an algorithm [C]// Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2002: 849-856.

[17]THEODORIDIS S, KOUTROUMBAS K. Pattern recognition [M]. Waltham: Academic Press, 2003.

上一篇:浅析建筑施工企业的施工阶段成本控制 下一篇:汽车行业工厂中工位器具的作用和分类