有效的不确定数据概率频繁项集挖掘算法

时间:2022-10-16 09:28:00

有效的不确定数据概率频繁项集挖掘算法

摘要:针对已有概率频繁项集挖掘算法采用模式增长的方式构建树时产生大量树节点,导致内存空间占用较大以及发现概率频繁项集效率低等问题,提出了改进的不确定数据频繁模式增长(PUFPGrowth)算法。该算法通过逐条读取不确定事务数据库中数据,构造类似频繁模式树(FPTree)的紧凑树结构,同时更新项头表中保存所有尾节点相同项集的期望值的动态数组。当所有事务数据插入到改进的不确定数据频繁模式树(PUFPTree)中以后,通过遍历数组得到所有的概率频繁项集。最后通过实验结果和理论分析表明:PUFPGrowth算法可以有效地发现概率频繁项集;与不确定数据频繁模式增长(UFGrowth)算法和压缩的不确定频繁模式挖掘(CUFPMine)算法相比,提出的PUFPGrowth算法能够提高不确定数据概率频繁项集挖掘的效率,并且减少了内存空间的使用。

关键词:数据挖掘;不确定数据;可能世界模型;概率频繁项集;频繁模式

中图分类号: TP301.6 文献标志码:A

英文摘要

Abstract:When using the way of pattern growth to construct tree structure, the exiting algorithms for mining probabilistic frequent itemsets suffer many problems, such as generating large number of tree nodes, occupying large memory space and having low efficiency. In order to solve these problems, a Progressive Uncertain Frequent Pattern Growth algorithm named PUFPGrowth was proposed. By the way of reading data in the uncertain database tuple by tuple, the proposed algorithm constructed tree structure as compact as Frequent Pattern Tree (FPTree) and updated dynamic array of expected value whose header table saved the same itemsets. When all transactions were inserted into the Progressive Uncertain Frequent Pattern tree (PUFPTree), all the probabilistic frequent itemsets could be mined by traversing the dynamic array. The experimental results and theoretical analysis show that PUFPGrowth algorithm can find the probabilistic frequent itemsets effectively. Compared with the Uncertain Frequent pattern Growth (UFGrowth) algorithm and Compressed Uncertain FrequentPattern Mine (CUFPMine) algorithm, the proposed PUFPGrowth algorithm can improve mining efficiency of probabilistic frequent itemsets on uncertain dataset and reduce memory usage to a certain degree.

英文关键词

Key words:data mining; uncertain data; possible world model; probabilistic frequent itemset; frequent pattern

0 引言

随着网络技术的快速发展,网络的实际应用中会产生许多不确定性数据,例如传感器采集的数据[1]、通过全球定位系统 (Global Positioning System,GPS)定位获取的地理位置信息[2]、网上商城的商品浏览信息等。相对于确定数据,不确定性数据通常是采用概率数值来表示,具有数据不确定性。这给数据挖掘工作带来了一定的挑战,不确定数据挖掘的研究也得到了人们的广泛关注。

近年来,针对不确定数据的挖掘已有多种方法。主要有关于不确定数据分类的朴素贝叶斯分类 (Uncertain data naive Bayes classification,UnBayes)方法[3]、不确定数据聚类的K均值聚类 (Uncertain data Kmeans,UKmeans)方法等[2,4]。根据采用模型的不同,可以把不确定数据频繁项集挖掘分为基于期望支持数模型[5-7]和基于项集支持数概率模型[8]两种。基于期望支持数模型是把期望支持数大于用户定义的最小期望阈值的项集视作频繁项集,但是这类算法并没有考虑项集支持数的分布情况,可能会导致某些概率频繁项集丢失;基于项集支持数概率模型是把支持数概率不小于用户定义的支持数概率的项集视为概率频繁项集,这类算法更好地考虑到了项集支持数的概率分布情况,因此具有较好的准确性。Chui等[9]提出的补充中英文全称原文中没有给出完整的英文格式,文献中直接给出的就是U-Apriori,但是能够确定U代表的意思,中文意思是无法给出的,故按照要求可以更改为:Chui等[9]提出的(Uncertain Apriori Algorithm, U-Apriori)算法采用逐层搜索的方式,UApriori(Uncertain Apriori Algorithm)算法采用逐层搜索的方式,类似于先验算法,该算法是由K频繁项集产生K+1频繁项集。然而这种算法需要多次扫描数据库并且会产生大量的候选项集,导致概率频繁项集的挖掘效率比较低。Leung等[5]设计的不确定数据频繁模式增长 (Uncertain Frequent Pattern Growth,UFGrowth)算法借鉴频繁模式增长补充中文全称 (Frequent Pattern Growth,FPGrowth)算法[10]的思想来构造树,但产生的树结构不如频繁模式树(Frequent Pattern Tree,FPTree)结构紧凑,因此会导致对内存的消耗比较大。在UFGrowth算法的基础上,Leung等[6-7]先后提出了基于上界的不确定数据频繁模式增长(Capped Uncertain Frequent pattern Growth,CUFGrowth)算法、基于前缀上界的不确定数据频繁模式增长(Prefixcapped Uncertain Frequent pattern Growth,PUFGrowth)算法,通过压缩树结构减少内存空间占用,但这类算法仍然需要通过探索、合并的方式发现概率频繁项集,导致发现概率频繁项集耗时较多。

[5]LEUNG C KS, MATEO M A F, BRAJCZUK D A. A treebased approach for frequent pattern mining from uncertain data [C]// PAKDD 2008: Proceedings of the 12th PacificAsia Conference on Advances in Knowledge Discovery and Data Mining, LNCS 5012. Berlin: Springer, 2008: 653-661.

[6]LEUNG C KS, TANBEER S K. Fast treebased mining of frequent itemsets from uncertain data[C]// DASFAA 2012: Proceedings of the 17th International Conference on Database Systems for Advanced Applications, LNCS 7238. Berlin: Springer, 2012: 272-287.

[7]LEUNG C KS, TANBEER S K. PUFtree: a compact tree structure for frequent pattern mining of uncertain data [C]// PAKDD 2013: Proceedings of the 17th PacificAsia Conference on Advances in Knowledge Discovery and Data Mining, LNCS 7818. Berlin: Springer, 2013: 13-25.

[8]BERNECKER T, KRIEGEL HP, RENZ M, et al. Probabilistic frequent itemset mining in uncertain databases [C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2009: 119-128.

[9]CHUI CK, KAO B, HUNG E. Mining frequent itemsets from uncertain data [C]// PAKDD 2007: Proceedings of the 11th Pacific-

Asia conference on Advances in Knowledge Discovery and Data Mining, LNCS 4426. Berlin: Springer, 2007: 47-58.

[10]HAN J, PEI J, YIN Y, et al. Mining frequent patterns without candidate generation: A frequentpattern tree approach [J]. Data Mining and Knowledge Discovery, 2004, 8(1): 53-87.

[11]LIN CW, HONG TP. A new mining approach for uncertain databases using CUFP trees [J]. Expert Systems with Applications, 2012, 39(4): 4084-4093.

[12]ZHOU A, JIN C, WANG G, et al. A survey on the management of uncertain data[J]. Chinese Journal of Computers, 2009, 32(1): 1-16.(周傲英,金澈清,王国仁,等.不确定性数据管理技术研究综述[J].计算机学报,2009,32(1):1-16.)

[13]AGGARWAL C C, YU P S. A survey of uncertain data algorithms and applications [J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(5): 609-623.

[14]WANG L, CHEUNG D W L, CHENG R, et al. Efficient mining of frequent item sets on large uncertain databases [J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(12): 2170-2183.

[15]BERNECKER T, CHENG R, CHEUNG D W, et al. Modelbased probabilistic frequent itemset mining [J]. Knowledge and Information Systems, 2013, 37(1): 181-217.

上一篇:基于不均衡样本重构的加权在线贯序极限学习机 下一篇:分布式在线交替方向乘子法