A popular solution toimprove the speed and scalability of the association rule mining is todo the algorithm on a random sample instead of the entire database . buthow to effectively define and efficiently estimate the degree of errorwith respect to the outcome of the algorithm , and how to determine the samplesize needed are entangling researches until now . in this paper , an effective and efficient algorithm is given based on the pac probably approximate correct learning theory to measure and estimatesample error 關(guān)聯(lián)規(guī)則挖掘作為數(shù)據(jù)挖掘的核心任務(wù)之一,由于其任務(wù)本身的復雜性通常需要多次整個掃描數(shù)據(jù)庫才能完成挖掘任務(wù)且頻繁模式可能產(chǎn)生組合爆炸,使得從原始的大規(guī)模數(shù)據(jù)集上抽取一部分樣本,在其上尋找用戶感興趣的近似規(guī)則成為目前提高算法效率和可擴展性的一種簡單有效的現(xiàn)實可行方法之一。
Then , a new adaptive , on - line , fast samplingstrategy - multi - scaling sampling - is presented inspired by mra multi - resolution analysis and shannon sampling theorem , for quicklyobtaining acceptably approximate association rules at appropriate samplesize . both theoretical analysis and empirical study have showed that thesampling strategy can achieve a very good speed - accuracy trade - off 但是,取樣策略必須在算法的效率和結(jié)果的精確性之間進行很好的折中, “如何確定合適的樣本大小使得運行于其上的關(guān)聯(lián)規(guī)則挖掘滿足精確性的要求取樣復雜性”成為這一方法的關(guān)鍵難解問題。