An improvement of FP-Growth association rule mining algorithm based on adjacency table

FP-Growth algorithm is an association rule mining algorithm based on frequent pattern tree (FP-Tree), which doesn’t need to generate a large number of candidate sets. However, constructing FP-Tree requires two scansof the original transaction database and the recursive mining of FP-Tree to generate frequent itemsets. In addition, the algorithm can’t work effectively when the dataset is dense. To solve the problems of large memory usage and low time-effectiveness of data mining in this algorithm, this paper proposes an improved algorithm based on adjacency table using a hash table to store adjacency table, which considerably saves the finding time. The experimental results show that the improved algorithm has good performance especially for mining frequent itemsets in dense data sets.


Introduction
Data mining is a process of obtaining potentially useful knowledge from data [1].As an important part of data mining, association rule mining reflects the intrinsic relationship between complex itemsets [2].Agrawal et al [3,4] proposed the Boolean association rule proplem and the corresponding Apriori algorithm.Considering the disadvantages of the Apriori algorithm, J. Han et al. proposed the FP-Growth algorithm using the FP-Tree to generate frequent itemsets [5].It compresses the transaction itemsets into FP-Tree to store the association information of itemsets with FP-Tree and generates frequent itemsets using the FP-Tree [6].Although the algorithm requires two database scans, and doesn't need to generate candidate sets [7], it needs to create FP-Tree that contains all the itemsets, which requires lots of memory.If frequent itemsets in the database is too many and the memory can't load the mapping information of all the items in the FP-Tree, the algorithm won't be effective [8].Besides, scanning the transaction database twice also makes the performance of the algorithm low.
This paper proposes an improved FP-Growth algorithm based on adjacency table which draws on the idea of graphs.After scanning the itemsets in the transaction database, we adopts a storage method combing the adjacency table with the hash table, which can remove itemsets that are less than the minimum support as soon as possible and avoid generating all nonempty subsets of the long largest frequent itemsets.The algorithm makes full use of the established adjacency table, and only needs to scan the original transaction database once.It has the advantages of fast running speed, small memory consumption and low complexity.
The rest of this paper is organized as follows: In Section 2, related works are discussed.The section 3, we proposes the improvement of the FP-Growth algorithm based on adjacent table and the mining process of frequent itemsets.Section 4, we analyse the time performance between FP-Growth algorithm and the improved one.In Section 5, we do experiments to compare the performance of FP-Growth algorithm with the improved one on various itemsets.In last section, we present our conclusions and future work.

Related works
Association rule mining is an important data analysis method and data mining technology [9].Although Agrawal et al. proposed the Apriori algorithm, the algorithm uses iterative process for the data subset and uses the candidate itemsets produced earlier to generate frequent itemsets later, which results in low efficiency of the algorithm and being difficult to be used in the mining of massive data [10,11].In response to the disadvantages of the Apriori algorithm, J. Han proposed the FP-Growth algorithm to generate frequent itemsets [5].It compresses the transaction itemsets into FP-Tree to store the association information of itemsets with FP-Tree and generates frequent itemsets using the FP-Tree [12].
The algorithm introduces an data structures including three parts.The first part is the header-table.It is used to record the frequency of occurrences of all itemsets and then sort descendly by the frequency recorded; the second is FP-Tree, which maps the original itemsets to FP-Tree in memory and maintains the association information between itemsets; the third is the list of nodes.The frequent itemsets in all header tables are the node lists' heads which respectively point to the position of frequent itemsets in the FP-Tree [13].Although the algorithm requires two database scans, and does not need to generate candidate itemsets [14], it needs to create FP-Tree that contains all the itemsets.If frequent itemsets in the database is too many and the memory can't load the mapping information of all the itemsets in the FP-Tree, the algorithm won't be effective [15].Besides, scanning the database twice also makes the algorithm inefficient.The DMFIA algorithm is an improvement based on the FP-Growth algorithm, which reduces the frequency of database scans, but still adopts the FP-Tree storage structure and the traversal method,whice has to search many layers and generate a lot of candidate itemsets at each layer leading to the low efficiency of the algorithm [16].
This paper proposes an improved FP-Growth algorithm.After scanning the itemsets in the data, we adopts a storage method combing adjacency table with hash table, which can remove itemsets quickly that are less than the minimum support .The algorithm makes full use of the established adjacency table, and only needs to scan the original database once.It has the advantages of fast running speed, small memory consumption and low complexity.

Improvement of FP-Growth algorithm based on adjacent table
The FP-Growth algorithm scans the database shown in table1 twice, figure .1 shows that how the transaction database converted into the FP-Tree.However, for large-scale data sets, the algorithm has shortcomings of memory and computational, making the algorithm inefficient [17].

Generation of adjacency table
Takingthe database intable 1 as an example, theitems ineach itemsets can be considered related to each other and form a complete graph.Once the same two items are associated, the weight of the edge is incremented by one.The weight of the final edge is the association frequency.After the first scan of the database, the formed association relationship graph is shown in figure 2.

Time complexity analysis
This paper compares the time performance of the FP-Growth algorithm with an improved algorithm based on adjacency table.The following are the symbols used for performance analysis.n: the number of transactions in the entire database; n1: The number of itemsets in the FP-Tree corresponding to the original database that is the number of leaf nodes in the FP-Tree; n2: The average number of items in per itemset I; �t � : Time to read transaction i from the original database; �棈 � : Time of FP-Growth counting frequency of each item in the database; �棈 � : Time of improved FP-Growth counting frequency of each item in the database; �� � : Time consumption of sorting each itemset by head table order; ��〵 �� : Time to insert each item into FP-Tree;��〵 � : Time to insert each item into the adjacency table; �� � : Time to get frequent itemsets from FP-Tree; �� � : Time to get frequent itemsets from the adjacency table; � �� : Time to eventually find all frequent itemsets from the original database using FP-Growth algorithm; Tg: Time to find all frequent itemsets from the original database using improved FP-Growth algorithm.
When FP-Tree is close to the binary tree, time complexity of FP-Growth is lowest.Formula (1) above is approximately equal to Formula (2).
T g = i=1 n (tr i + gin i + gf i + gl i ) Since this paper adopts hash tables to design adjacency tables, the most time improved FP-Growth algorithm cost is approximately equal to Formula (4) below.
When constructing an FP-Tree, it is necessary to sort each itemset by the order of the head table.Besides, counting the frequency of each item needs to traverse the header table.
On the contrast, when constructing the adjacency table, each itemset doesn't have to be sorted but traverse the hash table simply, so Formula (5) and Formula (6) can be as following: i=1 n (tr i + tl i + ts i ) > i=1 n (tr i + gr i ) i=1 n log 2 i = log 2 n! (6) Formula ( 7) can be deduced by Stirling's approximation: Since the average number of items n2 of each transaction set I is less than the number of leaf nodes n1 in the FP-Tree, the following can be obtained by Formula (4), ( 5), (6), and (7)

Experimental results
To study the performance of the algorithm, this paper compares the FP-Growth algorithm with the improved one on sparse dataset and dense dataset under the same experimental environment.The sparse dataset averages 10 items per transaction set.In figure 3, the minimum support counting of each transaction database is 100; in figure 4, the number of transaction set is 500,000.The dense dataset averages 23 items per transaction set.In figure 5, the minimum support counting of each transaction database is 500; in figure 6, the number of transaction set is 30,000.From the experimental results, it can be concluded that the effiency of mining the frequency itemsets using the FP-Growth algorithm improved obviously after improving the FP-Growth algorithm.Especially when dealing with maasive frequent itemsets, the effect is more prominent.The main reason lies in that the improved algorithm only needs to scan the transaction database once, and doesn't have to sort many itemsets by the support frequency.Further more,the fast lookup of the hash table also helps save more time, even in the case of massive frequent itemsets.

Conclusions and future scope
After studying the mining process of association rules of FP-Growth algorithm, this paper proposes an improved FP-Growth algorithm based on adjacency table.It significantly improves the performance of the algorithm to a certain degree.First of all, the improved algorithm only scans the transaction database once, which reduces the I/O operations greatly.Secondly, It doesn't require the establishment of a header table and a large number of sort operations.Finally, when mining frequent itemsets, the improved algorithm adopts the hash table for the fast lookup and does not need recursive mining.These have considerably reduce the algorithm's time and memory consumption.Especially in dealing with dense transaction items, the improved algorithm shows high performance and is supposed to have great application value.The future work will perfect the FP-Growth algorithm combing the application, and study the improvement in parallelization.

Fig. 1 .
Fig.1.The transaction database is converted into the FP-Tree.

Fig. 2 .
Fig. 2. The formed association relationship diagram and the generated adjacency table.

Fig. 3 .
Fig.3.Comparison of effects on different numbers of sparse transaction items.

Fig. 4 .
Fig.4.Comparison of different support counting for sparse transaction items.

Fig. 5 .
Fig.5.Comparison of effects on different numbers of dense transaction items.

Fig. 6 .
Fig.6.Comparison of different support counting for dense transaction items.