A scalable association rule learning heuristic for large datasets

Many algorithms have proposed to solve the association rule learning problem. However, most of these algorithms suffer from the problem of scalability either because of tremendous time complexity or memory usage, especially when the dataset is large and the minimum support (minsup) is set to a lower number. This paper introduces a heuristic approach based on divide-and-conquer which may exponentially reduce both the time complexity and memory usage to obtain approximate results that are close to the accurate results. It is shown from comparative experiments that the proposed heuristic approach can achieve significant speedup over existing algorithms.


Introduction
The association rule learning problem has played a significant role in data mining for the past few decades. Association rules are widely used in many fields, including market basket analysis [1] and bioinformatics [2]. However, the problem has an NPhard nature, meaning it is challenging to find the results within a reasonable period of time.
The invention of the Apriori Algorithm [3] made this problem computationally feasible for most computers on regular-sized datasets. Since then, researchers have continued to develop more scalable algorithms. Among others, FP-Growth [4] and Eclat [5] are two algorithms developed that improve the scalability of the Apriori algorithm.
The increasing popularity of the Internet in recent decades has made big data available to many research institutions and companies. Their sizes are so large that traditional algorithms may not be able to handle them efficiently. We consider "big data" to be datasets that, at least, are too large to fit into the memory and take a long time (hours or even days) for traditional algorithms to process. The term big data is thus relevant to the machine. A dataset considered to be big data on a PC may be a small dataset on a powerful high-performance computer (or computer cluster). This imposes a challenge to the association rule learning problem as well. Most of the previously designed algorithms, including the Apriori algorithm, the FP-Growth algorithm, and the Eclat algorithm, suffer from the problem of scalability for big data. Still, these algorithms take an unacceptable amount of time to terminate (will be discussed in "Experiments and results" section). In addition, the FP-Tree of the FP-Growth algorithm, and the TID list of the Eclat algorithm may not fit in the memory.
This paper introduces an approach that makes it possible to mine association rules and frequent itemsets for large datasets. The approach, called the Scalable Association Rule Learning (SARL) heuristic, follows the divide-and-conquer paradigm and it vertically divides a dataset into almost equivalent partitions using a graph representation and the k-way graph partitioning algorithm [6]. The total time complexity of the SARL heuristic, including the overhead of partitioning a dataset, is up to 2 d faster than that of the Apriori algorithm, where d is the number of unique items in the dataset. The memory usage is also lower than those of the current algorithms. Because of the speedup, our heuristic may be applied to real-time data analysis that can benefit many scientific [7] and military applications [8,9].
The rest of the paper is organized as follows. In "Related work" section, we survey existing association rule learning algorithms and graph partitioning algorithms. In "Our solution" section, we present the SARL heuristic with examples, formal descriptions, theorems, and proofs. The experiments and results are presented in "Experiments and results" section, followed by conclusions and future work.

Contributions
The contributions of this paper include the provision of association graphs that represent an efficient estimation of potential frequent itemsets and the use of the MLkP algorithm to divide the items into partitions while minimizing the loss of information.
The novelty of this paper lies in two parts of our solutions (discussed later). Firstly, we propose the verticle(item-wise) partition of datasets while most divide and conquer algorithms focus on horizontal (transaction-wise) divide and conquer methods. Secondly, the transformation of frequent two-itemsets into graph representation while applying the efficient MLkP algorithm is a novel and efficient approach in solving the association rule learning problem.

Related work
Association rule learning/frequent itemset mining has been an active research area. Among others, three approaches are considered the most popular and possibly the most efficient: the Apriori algorithm, the FP-Growth algorithm, and the Eclat algorithm.

The Apriori algorithm
The Apriori algorithm [3], introduced by Agrawal and Srikant, was the first efficient association rule learning algorithm. It incorporates various techniques to speed up the process as well as to reduce the use of memory. For example, the L k-candidates generated, and the pruning process can significantly reduce the number of possible candidates at each level.
One of the most important mechanisms in the Apriori algorithm is the use of the hash tree data structure. It uses this data structure in the candidate support counting phase to reduce the time complexity from O(kmn) to O(kmT + n), where k is the average size of the candidate itemset, m represents the number of candidates, n represents the number of items in the whole dataset, and T is the number of transactions.
The major advantage of the Apriori algorithm comes from its memory usage because only the k − 1 frequent itemsets, L k − 1 , and the candidates in level k, C k , need to be stored in the memory. It generates the minimum number of candidates based on the L k−1 × L k−1 (described in [3]) and the pruning method, and it stores them in the compact hash tree structure. In case the candidates fill up the memory from the dataset and a low minsup setting, the Apriori algorithm does not generate all the candidates to overload the memory. Instead, it generates as many candidates as the memory can hold.

The FP-growth algorithm
The Frequent Pattern Growth algorithm was proposed by Han et al. in 2000 [4]. It uses a tree-like structure (called Frequent Pattern Tree) instead of the candidate generation method used in the Apriori algorithm to find the frequent itemsets. The candidate generation method finds the candidates of the frequent itemsets before reducing them to the actual frequent itemsets through support counting.
The algorithm first scans a dataset and finds the frequent one itemsets. Then, a frequent pattern tree is constructed by scanning the dataset again. The items are added to the tree in the order of their support. Once the tree is completed, the tree is traversed from the bottom, and a conditional FP-Tree is generated. Finally, the algorithm generates the frequent itemsets from the conditional FP-Tree.
The FP-Growth algorithm is more scalable than the Apriori algorithm in most cases since it makes fewer passes and does not require candidate generation. However, it suffers from memory limitations since the FP-Tree is fairly complex and may not fit in the memory. Traversing the complexed FP-Tree may also be time-expensive if the tree is not compact enough.
The Eclat algorithm takes advantage of scanning the dataset only once. However, when the dataset is large, and the minsup is set to a low value, the TID associated with each itemset may become very long. In fact, the results can be larger than the original dataset; therefore, they may not fit into the memory.

Other association rule learning algorithms
There are three categories of association rule mining/frequent itemset mining algorithms [10]: Apriori-based algorithms, tree-based algorithms, and pattern growth algorithms. The Apriori algorithm, the Eclat algorithm, and the FP-Growth algorithm are the most popular algorithms for the three categories, respectively.
In the Apriori-based algorithm category, proposed by Agrawal and Srikant in [3] the AprioriTID algorithm is similar to Apriori, except that it generates C k -bar and it mines the frequent itemsets from there instead of the dataset. The Apriori Hybrid algorithm [3] is a combination of the Apriori algorithm and the AprioriTID algorithm. The DHP (direct hashing and pruning) algorithm [11] uses a hash function to distribute the itemsets into buckets. If a bucket has the support lower than the minsup, then the bucket is discarded. The MR-Apriori [12] and HP-Apriori [13] algorithms are distributed versions of the Apriori algorithm. The MR-Apriori uses the MapReduce model on the Hadoop platform. They enable parallel execution of the Apriori algorithm.
The tree-based algorithms, represented by the Eclat algorithm, find the frequent itemset by constructing a lexicographic tree. The AIS algorithm [6] and the SETM algorithm [14] are the two earliest association rule mining algorithms in this category. Reference [3] shows that the Apriori algorithm beats them in running time. The Tree-Projection algorithm [15] counts the supports of the frequent itemsets and uses the nodes of a lexicographic tree as the representation of these support numbers. The TM algorithm [16] maps the TID of each transaction to transaction intervals before performing intersections between these intervals.
Lastly, the algorithms in the pattern growth category focus on frequent patterns. The P-Mine algorithm [17] is a parallel computing algorithm that utilizes the VLD-BMine data structure to store the dataset and speed up the distribution of data, while the LP-Growth algorithm [18] makes use of an array-based linear prefix tree to improve the memory efficiency. The Can-Mining algorithm [19] finds the frequent itemsets from a canonical-order tree, which speeds up the tree traversal process when the number of frequent itemsets is low. Finally, the EXTRACT algorithm [20] uses the theory of Galois lattice to derive association rules.
The algorithms discussed above, unfortunately, have scalability problems. The Apriori-based algorithms, represented by the Apriori algorithm, have to go through the expensive candidate generation and support counting process. This causes a disadvantage in running time. The tree-based and the pattern-growth type algorithms often suffer from excessive usage of memory. For example, the FP-Growth algorithm could build a complex FP-Tree which does not fit into the memory.
We show the scalability problems of the Apriori algorithm and the FP-Growth algorithm in the experiment part of this paper. Both of the algorithms take too long to finish for most of the tested datasets. The need for faster, frequent itemset mining is urgent due to the vastly available data today. Companies and institutions have allocated many resources in data mining, and they need a time-saving, resource-saving solution. In addition, real-time data analysis plays an important role in government [21], scientific [7], and military [8,9] applications. The experiments part of this paper shows that the current algorithms represented by the Apriori algorithm and the FP-Growth algorithm are not fast enough to complete real-time data analysis. The scalability problems of most existing association rule mining algorithms have also been addressed in [22] that is focused on paralleled computing of association rules whereas this paper presents a scalable algorithm that is suitable for a single machine also.

Graph partitioning algorithms
One of the key steps in the SARL heuristic that we will introduce shortly is to partition the IAG (item association graph, "Our solution" section, part 7) into k balanced partitions. An efficient graph partitioning algorithm is crucial since the balanced graph partitioning problem is NP-complete [23]. We have implemented three algorithms and compared them for the partitioning costs and running times. They are the recursive version of the Kernighan-Lin Algorithm [24], the Multilevel k-way Partitioning Algorithm (MLkP) [25], and the recursive version of the Spectral Partitioning Algorithm [26]. Other graph partitioning algorithms include the Tabu search-based MAGP algorithm [27] and the flow-based KaFFPa algorithm [28].
The Kernighan-Lin algorithm swaps the nodes assigned to both partitions and finds the largest decrease in the total cut size. The Multilevel k-way Partitioning algorithm (MLkP) uses coarsening-partitioning-uncoarsening/refining steps to shrink a graph into a much smaller graph. After partitioning, the graph is rebuilt to restore the original graph. A single global priority queue is used for all types of moves. The Spectral Partitioning Algorithm finds splitting of the values such that the vertices in a graph can be partitioned with respect to the evaluation of the Fiedler vector.
Experiments are conducted by us to compare the three algorithms. The datasets provided by Christopher Walshaw at the University of Greenwich [29] are used. The datasets are as large as possible while the partitioning algorithms can finish in a Table 1 Results of the experiment that compare MLkP, Kernighan-Lin, and Spectral Partitioning algorithms reasonable time on the tested machine. We also run experiments on complete graphs with 30 and 300 nodes. Each dataset is tested four rounds with the number of partitions (k) being 2, 4, 8, and 16.
As shown in Table 1, the running times are highlighted in the red box. We can tell from average running time(the last row) that the MLkP algorithm has the highest speed in general. It is 560 times faster than the spectral partitioning algorithm and even faster than the recursive Kernighan-Lin algorithm. The spectral partitioning algorithm has, in general, the best partition quality. It is 1.3 times better than MLkP and much better than the recursive Kernighan-Lin algorithm. The recursive Kernighan-Lin algorithm takes too long to complete all five datasets. It also shows serious scalability issues for complete graphs.
Considering the MLkP algorithm has the best overall performance, we choose to use this algorithm for graph partitioning in our algorithm.

Definitions
Below are some definitions that we will use in our algorithm: 1. K-itemset: an itemset with k items 2. Support: the occurrence of an item in the dataset 3. Minsup: the minimum requirement of support. The user usually provides this.
Itemsets with support < minsup are eliminated. 4. Confidence: the indication of robustness of a rule in terms of percentage. 5. Minconf: the minimum requirement of confidence. The user usually provides this.
Rules with confidence < minconf are eliminated. 6. Item-Association Graph: a graph structure that stores the frequent associations between pairs of items. 7. Balanced K-way Graph Partitioning Problem: Divide the nodes of a graph into k parts such that each part has almost the same number of nodes while minimizing the number of edges/sum of edge weights cut off.

A scalable heuristic algorithm-SARL-heuristic
The following is an outline of our scalable heuristic: Step 1: Find frequent one and two itemsets using the Apriori algorithm (when minsup is high) or the direct generation method (when minsup is low).
Step 2: Construct the item association graph (IAG) from the result of step 1.
Step 3: Partition the IAG using the multilevel k-way partitioning algorithm (MLkP).
Step 4: Partition the dataset according to the result of step 3.

Fig. 1 An item association graph
Step 5: Call the modified Apriori algorithm or the FP-Growth algorithm to mine frequent itemsets on each transaction partition.
Step 6: Find the union of the results found from each partition.
Step 7: Generate association rules by running the Apriori-ap-genrules on the frequent itemsets found from step 6.

An example
Suppose the dataset shown in Table 2 is given and minsup is set to 0.1 (or 10%, or 7 * 0.1 ≈ 1 occurrence), and minconf is set to 0.7 (or 70%): First, we use the Apriori algorithm to find the frequent two itemsets. As an intermediate step, the Apriori algorithm finds the frequent one-itemset first (shown in Table 3): The frequent two-itemsets are found afterward (shown in Table 4):  Next, we transform the above frequent two-itemsets into an item association graph (IAG), shown in Fig. 1: To construct the graph, we first take the itemset {1, 2} with support 3. For this, we create node 1 and node 2 corresponding to the two items in the itemset. The edge between node 1 and node 2 has weight 3, representing the support of the itemset. The process is repeated for every frequent two-itemset found in the previous step.
Next, we use the multilevel k-way partitioning algorithm (MLkP) to partition the IAG. In this case, the number of nodes is small, so we only bisect the graph by setting k = 2. The result is shown in Figs. 2 and 3.
The MLkP algorithm divides the IAG into two equal or almost equal sets in linear time while the sum of the weights of edges that are cut off is the minimum.    Next, we partition the dataset according to the partitions of the IAG, as shown in Tables 5 and 6. Each transaction partition has all the items from the corresponding IAG partition. However, since the algorithm has already found all frequent one and two itemsets, a transaction is not added to a transaction partition if the transaction has less than three items. For example, T000: {1, 2} is not added to the transaction partition 1, since it only has two items. Some items in the original dataset may not appear in any of the transaction partitions, because the infrequent one/two-itemsets are dropped in the IAG. This simplifies the subsequent computations. In this example, however, all the items are kept in the IAG because the IAG is a relatively dense graph. Tables 5 and 6 show the transaction partitions: The next step is to pick the best algorithm and use it to find the frequent k-itemsets with k > 2. For this example, we choose the modified Apriori algorithm because it is faster for mining small datasets as it avoids the process of finding the one and twoitemsets again. The results from partition 1 are shown in Table 7: Since the modified Apriori algorithm starts with three-itemsets, there are no additional frequent itemsets in the first partition. Table 8 shows the results found in transaction partition 2: The final results (shown in Table 9) of frequent itemsets are simply the union of Tables 3, 4, 7, and 8: After running the Apriori-ap-genrules algorithm, the association rules can be found in Table 10.
All frequent itemsets generated by the SARL heuristic are sound, meaning each frequent itemset generated indeed is correct, and the support number is accurate. However, it is possible that some frequent itemsets cannot be found by the SARL heuristic, as will be discussed shortly. In this example, the SARL heuristic loses one frequent itemset {1, 4, 5} and two related rules generated from {1, 4, 5}.
Formal description of the SARL heuristic SARL(dataset, minsup, minconf, k, threshold): Finding frequent 2 itemsets using the Apriori algorithm or dirct_gen algorithm The first step of the SARL heuristic is to find the frequent 2 itemsets efficiently.
Although the Apriori algorithm has scalability issues for very large datasets, it provides a fast and convenient feature to extract intermediate results and a tolerable speed for the first two passes.
The Apriori algorithm finds frequent itemset L k for each k, and each L k is stored separately. We run the Apriori algorithm until it finds L 2 , the frequent two-itemset. It first tries to find the frequent one itemsets by traversing the dataset and count the occurrence of each unique item. If the number of occurrences of an item is less than the minsup provided by the user, that item is eliminated from the list of frequent oneitemset. The frequent two itemsets are discovered based on the frequent one itemsets. The algorithm generates C 2 , the candidate sets for the frequent two itemsets, using This method generates a minimum number of candidates from the frequent one itemsets so that we can have fewer candidates to consider in the support counting phase. The Apriori algorithm also predicts and eliminates some infrequent itemsets before support counting by implementing the Apriori principle in the pruning step. If an item in C2 is not in L1, which means that the item is infrequent, so all the two itemsets that include this item are dropped. We modify the Apriori algorithm, so it terminates after L2 is found.
Another method to find frequent one and two itemsets are through direct counting and generation. The algorithm to find frequent one itemsets is the same as the Apriori algorithm. To find frequent two itemsets, we can simply find all two-item pairs in each transaction and count the occurrence of them. The advantage of this algorithm is that it does not require candidate generation from L1, and avoids much unnecessary membership testing during support counting. However, this method is not efficient on large datasets since it does not use pruning and saves all two itemsets.
In the SARL heuristic, we ask the user for a threshold of the dataset size. If the dataset is larger than the threshold, the SARL heuristic will use the modified Apriori algorithm. Otherwise, it will use the direct_gen algorithm to compute the frequent one and two itemsets.

Construction of the item association graph
The item association graph G is constructed based on the two itemsets generated by the Apriori algorithm. G is an undirected, weighted graph. A node Vi is created for each unique item i in the two itemsets T with the maximum item number being n.
The edges E in graph G are formed for each itemset in T: The weight of each edge E ij is equal to the support of itemset {i, j} in T:

Partition the IAG using the multilevel k-way partitioning algorithm (MLkP)
The Multilevel k-way partitioning (MLkP) algorithm [25] is an efficient graph partitioning algorithm. The time complexity is O(E), where E is the number of edges in the graph, and the maximum load imbalance is limited to 3%.
The general idea of MLkP is to shrink (coarsen) the original graph into a smaller graph, then partition the smaller graph using an improved version of the KL/FM algorithm. Lastly, it restores (uncoarsen) the partitioned graph to a larger, partitioned graph.
METIS is a software developed by Karypis at the University of Minnesota [30]. It includes an implementation of the MLkP algorithm that takes a graph as the input and outputs groups of nodes separated after the partition.

Transaction partitioning
Based on the results of the MLkP algorithm that divide the items into groups P 1 , P 2 ,… ,P m , we can partition the transactions into the same number of groups, where each group D i contains only the items in partition P i . For a transaction tobe included in D i , it must have all the items from partition P i . If a transaction includes more items than the items from partition P i , only the items in P i that are included in the transaction are added to D i . That is, only a part of the transaction is added to D i . As a result, each transaction in a transaction partition must be a subset of the corresponding transaction in the original dataset. If a transaction has less than three items, the transaction is not added. This is because we have already mined the one and two itemsets, and are only interested in itemsets that have 3 or more items. This optimization helps to reduce the size of transaction partitions.
In the above, D i is transaction partition i, T j is the transactions to be added to partition i, S j is the jth transaction in the original dataset, P i is the item partition i, and D is the original dataset.
Since the number of unique items in each partition is less than or equal to numberofnodesinIAG k rather than totalnumberofuniqueitems k , the size of each partition should be small compared to the original dataset. In rare cases, if the size of a transaction partition is greater than the memory size, the SARL heuristic can partition the IAG and the transactions again with k incremented by 1. This guarantees that each partition fits into the memory.

Selecting an algorithm on transaction partitions
One of the benefits that come with our solution is that the association rule learning on each transaction partition can be optimized by using an algorithm that best fits the partition.
During the association rule learning on the partitioned datasets, we have three candidates that are considered efficient: the Apriori algorithm, the FP-Growth algorithm, and the Eclat algorithm.
Since the modified Apriori algorithm has already computed the one itemsets and two itemsets during the preparation phase, the candidate generation feature of the Apriori algorithm is handy in this case. We modify the Apriori algorithm to skip the frequent one/two itemsets finding stages and start with the frequent three itemsets    from the transaction partitions. This modification is particularly helpful when the minsup is set to a high value so that the expected number of itemsets is limited after the two itemsets are found.
We can estimate the expected number of itemsets from the average transaction length of each transaction partition. A higher average transaction length indicates a higher possibility of the presence of a long "tail" in the result. Results with long tails have itemsets with considerable maximum lengths, while results with short tails only contain itemsets with small maximum lengths. A dataset with an expected long tail means the association rule learning algorithm does not terminate soon after the two itemsets are found.
The average transaction length provides a fast and straightforward reference for selecting the best algorithm for each transaction partition. If the average transaction length is low, the Apriori algorithm can be the right choice, as the modified Apriori algorithm continues from the two itemsets that the preparation phase has already calculated. If the average transaction length is high, we can take advantage of the scalability of the FP-Growth algorithm. We omit the Eclat algorithm because the FP-Growth and the Eclat algorithms do not have the same advantage provided by the modified Apriori algorithm, of which the algorithm can start with the two itemsets. In addition, studies [31] show that the Eclat algorithm is slightly less scalable than the FP-Growth algorithm.
Next, the selected algorithm is used to find the frequent local itemsets from the given transaction partition. After the algorithm terminates, a simple union is performed on the frequent itemsets found from each partition. Finally, Apriori-apgenrule is used to derive the rules from the frequent itemsets. This step is relatively simple.

Time complexity and space complexity
The theoretical time and space complexity of the Apriori algorithm is O(2 d ) where d is the number of unique items in the dataset.

Time complexity
The theoretical time complexity of the SARL heuristic consists of the complexity of several parts: 2-itemsets generation Finding frequent 2-itemsets requires finding 1-itemsets first. This step is simply O(n) as the algorithm traverses the dataset once. Next, the candidate generation for 2-itemsets takes O(d 2 ) where d is the number of unique items in the dataset. Finally, the support checking requires O(n + d 2 T ) where T is the number of transactions in the dataset. Therefore, the time complexity of this step is O(d 2 T + n).
IAG construction Since each edge in the IAG is a representation of a frequent twoitemset, and the maximum number of two-itemsets is d 2 +d 2 , the maximum number of edges in IAG is also d 2 +d 2 . Therefore, constructing the IAG takes O(d + d 2 +d 2 ) or O d 2 .

IAG partition The time complexity of the IAG partition process is equal to the time complexity of the MLkP algorithm, which is O(E) or O(d 2 ).
Transaction partition The dataset is traversed once to assign items into different partitions. Hence the time complexity is O(n).
Running a selected algorithm The algorithm selection requires the calculation of the average transaction width of each transaction partition. The time complexity of this is O(kn), where k is the number of partitions.
If the modified Apriori algorithm is selected, the theoretical time complexity for each partition is O(2 1.03d/k ) where the coefficient 1.03 comes from the 3% maximum imbalance of the partitions caused by the MLkP algorithm. The total running time for all the partitions is O k * 2 faster than the Apriori algorithm. The exponential speedup comes from the smaller number of unique items in each transaction partition. The algorithm that is chosen to mine frequent itemsets from the transaction partitions only needs to consider a portion of all the items for each partition.

Space complexity
Like time complexity, the space complexity of the SARL heuristic consists of the complexity of several parts: 2-itemsets generation Finding the frequent two itemsets requires finding the one itemsets first. This step is O(d) , where d is the number of unique items in the dataset, as we need to keep at most d items in the memory. Next, the candidate generation step for the 2-itemsets takes O(d 2 ) space for at most d(d−1) 2 frequent 2-itemsets as candidates. Finally, the support checking requires another O(d 2 ) space to store the support numbers. Hence, this step requires O(d 2 ) space.
IAG construction Since each edge in the IAG is a representation of a frequent twoitemset, and the maximum size of the two-itemsets is d 2 +d 2 , the maximum number of edges in IAG is also d 2 +d 2 . Therefore, storing the IAG takes O d 2 space. This d 2 space occupation only occurs when every unique item in the dataset is included frequent two-itemsets with every other unique item in the dataset. In most cases, the actual space required to store IAG is smaller than the memory size.
In rare cases, if the IAG cannot fit into the memory, then the Apriori algorithm and FP-Growth algorithm must have memory issues, too. For the Apriori algorithm, all frequent two-itemsets must be stored in the memory to generate the candidates in the next level, and the size of frequent two-itemsets is similar to the IAG. FP-Tree must be stored in the memory for the FP-Growth algorithm. The space complexity of the FP-Tree is also O d 2 , however, all unique items need to be stored in the tree while only the unique items in the frequent two-itemsets need to be stored in the IAG. Therefore, IAG has a lower space complexity than the FP-Tree.
IAG partition The space complexity of the IAG partition is equal to the space complexity of the MLkP algorithm, which is O(E) or O(d 2 ).
Transaction partition The dataset is traversed once to assign items into different partitions. We can assume each partition can fit into the memory. Therefore, the space complexity is O( n k ). Selecting and running the selected algorithm The algorithm selection requires the calculation of the average transaction width of each transaction partition. The space complexity of this is O(k) = O(1), where k is the number of partitions.
If the modified Apriori algorithm is selected, the theoretical space complexity for each partition is O 2 ) space comparing to the Apriori algorithm. The exponential reduction of space usage comes from the smaller number of unique items in each transaction partition. If the modified Apriori is chosen to mine frequent itemsets from the transaction partitions, it only generates a smaller number of candidates for each transaction partition, since it does not consider items in other partitions.

Error bound
The SARL heuristic sacrifices some precision to obtain the speed up. However, every frequent itemset found by the algorithm is correct, and the support associated with each frequent itemset is also correct. The heuristic may miss some trivial frequent itemsets, i.e., the itemsets with low support. During the IAG partition phase, the MLkP algorithm makes cuts on the IAG to minimize the sum of the weights of the edges that are cut off. This feature helps to prevent large weights from cut off, while some trivial, small-weight (support) edges may be lost.
In the most (extreme) case, when every transaction has all the items and minsup is set to 0, we can calculate the error bound. In this case, the IAG is a complete graph, and the fraction of the edges cut off by the MLkP algorithm is n * n− n k E = (k−1)n k(n−1) . When n is very large, the fraction is approximately k−1 k . In this case, we can set k as low as 2 to still maintain 50% coverage for the frequent three or more itemsets. The calculation of frequent one and two itemsets is always accurate because they are calculated using the Apriori algorithm or the direct-generate algorithm.
The error rate should be significantly lower in more practical cases. However, it is difficult to estimate such an error rate considering it is affected by many factors such as the closeness of groups of items (i.e., does an item appear with only a small number of other items?), the choice of minsup, and the max length of the frequent itemsets. We can make a rough estimation by introducing a parameter P out , the ratio of the edges cut off in the IAG. P out = E cut E total . This parameter is determined by the characteristics of a dataset, the minsup choice, and the number of partitions we choose. P out is also a rough estimation of the error rate for the frequent two or more itemsets. Assume the ratio of the frequent two or more itemsets found is P m , then the total error bound can be computed as

Initial selection of number of partitions, k
The selection of k determines the speed and accuracy of the SARL heuristic. A larger k usually means faster speed and lower accuracy, and vice versa. Depending on the size of the dataset and the application, k = 2, 3, or 4 are some balanced choices. In rare cases, the heuristic will increase the k value if any transaction partition cannot fit into the memory based on the current setting of k.

Benefits of having datasets fit into the memory
According to "Our solution" section, Part 8, the transaction partitions are guaranteed to be small enough to fit into the memory. Therefore, any operations performed on these in-memory datasets should be faster than before. For example, the Apriori algorithm makes the number of passes on the dataset equal to the maximum length of frequent itemsets. Each of these passes requires reading the dataset from the disk. With our solution, the SARL heuristic makes at most two passes to the dataset. The first pass is to generate the frequent one and two itemsets, and in the second pass, the algorithm brings a fraction of the dataset into the memory. All further passes are made directly in the memory, resulting in speedup.
We do not analyze the communication cost between the main memory and the hard disk quantitatively in this paper. Due to the nature of our divide-and-conquer approach, we do not implement any additional swapping mechanism, so each partition is only brought into the memory once. Therefore, such cost should be no larger than the cost of the Apriori algorithm.

Theorem 1 Soundness-All frequent itemsets and association rules generated by the SARL heuristic are correct.
Proof Assume the SARL heuristic generates an incorrect frequent itemset. We can assume the correctness of the Apriori algorithm and the FP-growth algorithm.
(5) P m = #frequent2 + itemsets #totalfrequentitemsets (6) Error total = P m * P out Therefore, there must be an error in transaction partitioning. There could be two possible types of error in transaction partitioning: (Possibility 1) The support of some itemsets is higher or lower than it should be.
(Possibility 2) Some transactions include additional items or lose some items.
Assume the first possibility is true. We divide the dataset vertically (item-wise) during the transaction partitioning phase. Since every item in the original dataset D that belongs to P i must be added to D i , all unique items in a transaction partition must appear in the same number of transactions as the original dataset. Hence, the support of each itemset should be the same as the original dataset. This conflicts with the first possibility: the support of some itemsets is higher or lower than it should be.
Assume the second possibility is true. During the transaction partitioning phase, each transaction in the original dataset may be assigned to a transaction partition, or it may be split into different disjoint parts. Therefore, each transaction in a transaction partition must be a subset of the corresponding transaction in the original dataset, and this process cannot add any new items into any transactions. If some items are lost during the transaction partitioning phase, the results may have incorrect supports. However, we know that the union of the unique items in each transaction partition is equal to the unique items of the frequent two-itemsets, since the IAG partitioning cuts off some edges of IAG but not the nodes. According to the Apriori principle, a three-itemset can be frequent if and only if all its two-item subsets are frequent. This means that the unique items of three or more frequent itemsets must be a subset of the unique items of frequent two-itemsets. Hence, we have where I n is the unique items of frequent n-itemsets, P j is the unique items of transaction partition j, and m is the number of transaction partitions. Therefore, all items needed by the frequent three (or higher) itemsets are present in the transaction partitions. Hence, we find a contradiction between our algorithm and the second possibility.
In summary, since both possibilities are proved to be false, the SARL heuristic is sound. □ Theorem 2 Computing the frequent two itemsets is considered relatively trivial compared to computing the frequent three or more itemsets.
Proof If the computation of the frequent two itemsets takes more than half of the total computation time, we may say computing frequent two itemsets is not trivial.
To characterize the distribution of frequent itemsets is relatively difficult due to the challenges in modeling the data. We develop a mathematical model to simulate the (7) ∀n ≥ 3, I n ⊆ I 2 = m j=1 P j characteristics of any dataset. The relationships of all the frequent itemsets can be depicted using an itemset lattice diagram shown below: Figure 4 shows the case when every itemset has a support greater than minsup. However, in most cases, each layer will have some itemsets being removed due to either one of the two reasons: the anti-monotone property of the Apriori principle or the lack of support (i.e., support < minsup). To model the former, we apply the antimonotone property to the itemset lattice. The anti-monotone property is as follows: where if J = 2 I , I being a set of items, X is a subset of Y, then the measure f must be anti-monotone. Applying this property to the lattice, we can have the following explanation: if an itemset is infrequent, then all of its supersets must also be infrequent.
For example, in Fig. 5, if {1, 3} is infrequent, then {1, 2, 3}, {1, 3, 4}, and {1, 2, 3, 4} are all infrequent. To model this property, we can imagine that each infrequent itemset in the same layer causes some supersets in the next layer to be infrequent. The first infrequent itemset results in n-k + 1 infrequent itemsets in the next layer, where n is the number of unique items in the dataset, and k is the current layer number or the number of items in each itemset of the current layer. We know that each layer has C n k itemsets if none of them is infrequent. Then the next layer will have C n k+1 total itemsets. Since n-k + 1 is the number of current infrequent itemsets in the next layer,  The remaining frequent itemsets in layer k, considering the above estimation of the influence of the Apriori principle, is C n k − I k−1 . Let us assume the probability p that an itemset to be frequent, assuming its parent is frequent. We can have the final estimated number of frequent itemsets for layer k: For n = 200, p = 0.8, 0.6, 0.4, 0.2, 0.1, we can estimate the number of two, three, and more itemsets as shown in Table 11: For n = 2000, p = 0.8, 0.6, 0.4, 0.2, 0.1, we can estimate the number of two, three, and more itemsets as shown in Table 12: The above model with examples shows that the number of two itemsets is, on average, less than only 10% of the number of three or more itemsets. This means that only less than 10% of all computation power is consumed by the two itemsets. Thus, our algorithm speeds up the costly part, the part that mines three or more itemsets. □ Proof Assume that given f < 100%-3% or f < 97%, and a transaction partition has d i ≥ d/k unique items. According to our algorithm, since d i ≥ d/k , a partition in the IAG must have more than or equal to d/k nodes. As we assumed earlier, the maximum imbalance rate for the MLkP algorithm is set to 3%, then the number of nodes n in the IAG can be calculated as d k * 0.97 * k ≤ n ≤ d k * 1.03 * kor0.97d ≤ n ≤ 1.03d . Since n cannot be more than the total number of unique items, 0.97d ≤ n ≤ d . However, we know f < 97% or f *d < 0.97d, and n ≤ f * d since some frequent one itemsets (9)  may not appear in any frequent two itemsets, so n ≤ f * d < 0.97d and n < 0.97d . This contradicts 0.97d ≤ n ≤ d . Therefore, the assumption d i ≥ d/k is false, and the reverse, d i < d k , must be true. □

Experiments and results
We design and conduct experiments on both small and large datasets to demonstrate the scalability of our algorithm. The experiments are performed on a computer with the following settings: The datasets [32] we use include Bible [33], T10I4D100K [32], and T40I10D100K [32]. The details of each dataset will be discussed later.

The bible dataset
The Bible dataset has the following metrics: This is a small to medium-sized dataset. The experiments are done repeatedly for minsup of 50%, 40%, 30%, 20%, and 10%. The time limit for each experiment is set to 800 s for each of the experiments. The results are shown in Table 13:  According to Fig. 6, the two-partition, Apriori-based SARL heuristic scales the best for this dataset regardless of the minsup value. It is 2 to 2.5 times faster than the Apriori algorithm. The FP-Growth algorithm reaches the 800-s time limit for all test cases. It is possible that the number of unique items in this dataset is large; therefore the FP-tree cannot fit into the memory. As a result, the FP-growth algorithm does not perform well here. All other three settings of the SARL heuristic outperform the Apriori algorithm. Comparing to the FP-growth algorithm and the Apriori algorithm, the SARL heuristic is more scalable with all values set for minsup.
As we proved earlier, all the frequent itemsets found by the SARL heuristic are accurate, with the correct support. This is important because we need the accurate support to calculate the confidence of the rules. The SARL heuristic may miss some frequent itemsets with a lower support. Here, we calculate the accuracy = numberoffrequentitemsetsfoundbySARL numberoffrequentitemsetsfoundbyApriori . The accuracy of the SARL heuristic drops on the Bible dataset when the value of minsup is low. From Table 14 and Fig. 7, both settings of the four-partition SARL heuristic achieve 100% accuracy from the minsup range of 50% to 30%. This is because the MLkP algorithm is able to find a perfect or almost perfect cut on the IAG so that there are no inter-partition frequent itemsets for this range. When 100% accuracy is achieved, the SARL heuristic discovers not just the one and two frequent itemsets, but also the three or higher frequent itemsets. As for the two-partition SARL heuristic settings, the accuracy starts at 73.91% at 50% minsup and drops to 39.8% at 10% minsup.

The T10I4D100K dataset
The second dataset we have tested is T10I4D100K. It has the following statistics: The algorithms are tested on T10I4D100K for minsup of 10%, 4%, 1%, 0.7%, and 0.4%. This dataset has a medium size (for this environment), so the time limit is set to 300 s for each of the experiments. Table 15 and Fig. 8 shows the results for T10I4D100K: From Fig. 8, the Apriori algorithm has an average performance for the initial minsup of 10% and 4%. However, it quickly reaches the maximum running time after that  and unable to finish the task in time for all subsequent settings of minsup. The FP-Growth algorithm has a better performance. It is the fastest for a higher value of minsup of 10% and 4%, but it jumps to almost 300 s for 1% and 0.7%, before timeout at 0.4%. All settings of the SARL heuristic outperform the Apriori and the FP-Growth algorithm for middle and low settings of minsup. It is 8.6 to 13.8 times faster than the FP-Growth algorithm on minsup = 1% and 0.7%. The SARL heuristic is slightly slower at a high minsup of 10%, and they are tied with the Apriori but slightly slower than FP-Growth at a minsup of 4%.  The accuracy of the SARL heuristic is high on the T10I4D100K dataset. As shown in Table 16 and Fig. 9, all four settings of the SARL heuristic achieve 100% accuracy for the values of minsup from 10% to 0.4%. This is because, for a high minsup, the number of frequent three or more itemsets for this dataset is small comparing to frequent two itemsets, and the mining of the one and two frequent itemsets is accurate. For low minsup values, the MLkP algorithm successfully finds a perfect or almost perfect cut on the IAG, so the results are accurate.

The T40I10D100K dataset
The dataset T40I10D100K has the following statistics: This relatively large-size dataset was tested on minsup values of 20%, 10%, 7%, and 4%. The maximum running time was set to 300 s each for the experiments. Table 17 shows the results of the experiments: The results of the experiments (shown in Table 17 and Fig. 10) show an obvious distinction between the scalability of different algorithms. All settings of the SARL heuristic demonstrate high scalability. Almost all settings of the SARL heuristic have stable running time throughout the entire range of minsup. Surprisingly, the Apriori algorithm performs better than the FP-Growth algorithm with a minsup between 20 and 7%. However, it is still unable to terminate within the time limit for minsup = 4%. Lastly, the FP-Growth algorithm does not scale very well on this dataset. It fails to terminate within the given time for both 7% and 4% of minsup.
The accuracy of the SARL heuristic on the T40I10D100K dataset is the same as the T10I4D100K dataset. Table 18 and Fig. 11 show that the SARL heuristic has 100% accuracy based on similar reasons as we explained above in the analysis of the T10I4D100K experiment results.

Conclusions and future work
In this paper, we have proposed a scalable, highly parallelizable association rule mining heuristic (the SARL heuristic). The contributions include the use of the divide-and-conquer method to speed up complex computations, the use of an item association graph that provides an efficient estimation of potential frequent itemsets, and the use of the MLkP algorithm to divide the items into partitions while minimizing the loss of information. We have shown the scalability of the SARL heuristic through a series of experiments. The results indicate that the SARL heuristic has better scalability, with high accuracy, than both the Apriori and the FP-Growth algorithms in most cases.
As discussed, the proposed heuristic is limited by the space requirement that the memory should be large enough to accommodate the IAG (proportional to d 2 where d is the number of unique items in the transactions) which we think may be a reasonable assumption in practice.
In the future, we plan to extend our work with the following tasks: • Develop a parallel version of the SARL heuristic and its implementation. The transaction partitions can be considered as independent datasets, and we can easily run the modified Apriori algorithm or FP-Growth algorithm on each of the transaction partition in parallel and then merge the results (frequent three or higher itemsets) together along with the frequent one and two itemsets to obtain the total frequent itemsets. Each parallel processor does not need to communicate with others during the computation since all the information needed is already included in the local dataset. This would result in maximum utilization of each processor. • Study how different characteristics of the datasets influence the performance of the SARL heuristic. Although we know that the SARL heuristic has excellent performance for most datasets, the exact speed and accuracy of the SARL heuristic are still unpredictable. We think by applying some statistical measurements on the dataset, it is possible to estimate the accuracy and speed of the SARL heuristic roughly. This will help the user to determine if using the SARL heuristic is beneficial enough compared to other accurate algorithms.