A scalable association rule learning heuristic for large datasets

Li, Haosong; Sheu, Phillip C.-Y.

doi:10.1186/s40537-021-00473-3

Research
Open access
Published: 09 June 2021

A scalable association rule learning heuristic for large datasets

Journal of Big Data volume 8, Article number: 86 (2021) Cite this article

7027 Accesses
15 Citations
3 Altmetric
Metrics details

Abstract

Many algorithms have proposed to solve the association rule learning problem. However, most of these algorithms suffer from the problem of scalability either because of tremendous time complexity or memory usage, especially when the dataset is large and the minimum support (minsup) is set to a lower number. This paper introduces a heuristic approach based on divide-and-conquer which may exponentially reduce both the time complexity and memory usage to obtain approximate results that are close to the accurate results. It is shown from comparative experiments that the proposed heuristic approach can achieve significant speedup over existing algorithms.

Introduction

The association rule learning problem has played a significant role in data mining for the past few decades. Association rules are widely used in many fields, including market basket analysis [1] and bioinformatics [2]. However, the problem has an NP-hard nature, meaning it is challenging to find the results within a reasonable period of time.

The invention of the Apriori Algorithm [3] made this problem computationally feasible for most computers on regular-sized datasets. Since then, researchers have continued to develop more scalable algorithms. Among others, FP-Growth [4] and Eclat [5] are two algorithms developed that improve the scalability of the Apriori algorithm.

The increasing popularity of the Internet in recent decades has made big data available to many research institutions and companies. Their sizes are so large that traditional algorithms may not be able to handle them efficiently. We consider “big data” to be datasets that, at least, are too large to fit into the memory and take a long time (hours or even days) for traditional algorithms to process. The term big data is thus relevant to the machine. A dataset considered to be big data on a PC may be a small dataset on a powerful high-performance computer (or computer cluster). This imposes a challenge to the association rule learning problem as well. Most of the previously designed algorithms, including the Apriori algorithm, the FP-Growth algorithm, and the Eclat algorithm, suffer from the problem of scalability for big data. Still, these algorithms take an unacceptable amount of time to terminate (will be discussed in "Experiments and results" section). In addition, the FP-Tree of the FP-Growth algorithm, and the TID list of the Eclat algorithm may not fit in the memory.

This paper introduces an approach that makes it possible to mine association rules and frequent itemsets for large datasets. The approach, called the Scalable Association Rule Learning (SARL) heuristic, follows the divide-and-conquer paradigm and it vertically divides a dataset into almost equivalent partitions using a graph representation and the k-way graph partitioning algorithm [6]. The total time complexity of the SARL heuristic, including the overhead of partitioning a dataset, is up to 2^d faster than that of the Apriori algorithm, where d is the number of unique items in the dataset. The memory usage is also lower than those of the current algorithms. Because of the speedup, our heuristic may be applied to real-time data analysis that can benefit many scientific [7] and military applications [8, 9].

The rest of the paper is organized as follows. In "Related work" section, we survey existing association rule learning algorithms and graph partitioning algorithms. In "Our solution" section, we present the SARL heuristic with examples, formal descriptions, theorems, and proofs. The experiments and results are presented in "Experiments and results" section, followed by conclusions and future work.

Contributions

The contributions of this paper include the provision of association graphs that represent an efficient estimation of potential frequent itemsets and the use of the MLkP algorithm to divide the items into partitions while minimizing the loss of information.

The novelty of this paper lies in two parts of our solutions (discussed later). Firstly, we propose the verticle(item-wise) partition of datasets while most divide and conquer algorithms focus on horizontal (transaction-wise) divide and conquer methods. Secondly, the transformation of frequent two-itemsets into graph representation while applying the efficient MLkP algorithm is a novel and efficient approach in solving the association rule learning problem.

Related work

Association rule learning/frequent itemset mining has been an active research area. Among others, three approaches are considered the most popular and possibly the most efficient: the Apriori algorithm, the FP-Growth algorithm, and the Eclat algorithm.

The Apriori algorithm

The Apriori algorithm [3], introduced by Agrawal and Srikant, was the first efficient association rule learning algorithm. It incorporates various techniques to speed up the process as well as to reduce the use of memory. For example, the L_k-1 × L_k-1 method used in the candidate generation process can reduce the number of candidates generated, and the pruning process can significantly reduce the number of possible candidates at each level.

One of the most important mechanisms in the Apriori algorithm is the use of the hash tree data structure. It uses this data structure in the candidate support counting phase to reduce the time complexity from O(kmn) to O(kmT + n), where k is the average size of the candidate itemset, m represents the number of candidates, n represents the number of items in the whole dataset, and T is the number of transactions.

The major advantage of the Apriori algorithm comes from its memory usage because only the k − 1 frequent itemsets, L_k − ₁, and the candidates in level k, C_k, need to be stored in the memory. It generates the minimum number of candidates based on the ${L}_{k-1}\times {L}_{k-1}$ (described in [3]) and the pruning method, and it stores them in the compact hash tree structure. In case the candidates fill up the memory from the dataset and a low minsup setting, the Apriori algorithm does not generate all the candidates to overload the memory. Instead, it generates as many candidates as the memory can hold.

The FP-growth algorithm

The Frequent Pattern Growth algorithm was proposed by Han et al. in 2000 [4]. It uses a tree-like structure (called Frequent Pattern Tree) instead of the candidate generation method used in the Apriori algorithm to find the frequent itemsets. The candidate generation method finds the candidates of the frequent itemsets before reducing them to the actual frequent itemsets through support counting.

The algorithm first scans a dataset and finds the frequent one itemsets. Then, a frequent pattern tree is constructed by scanning the dataset again. The items are added to the tree in the order of their support. Once the tree is completed, the tree is traversed from the bottom, and a conditional FP-Tree is generated. Finally, the algorithm generates the frequent itemsets from the conditional FP-Tree.

The FP-Growth algorithm is more scalable than the Apriori algorithm in most cases since it makes fewer passes and does not require candidate generation. However, it suffers from memory limitations since the FP-Tree is fairly complex and may not fit in the memory. Traversing the complexed FP-Tree may also be time-expensive if the tree is not compact enough.

The Eclat algorithm

Different from the Apriori algorithm and the FP-Growth algorithm that work on horizontal datasets (e.g., T001: {1, 3} T002:{1, 4}), the Eclat (Equivalence Class Clustering and bottom-up Lattice Traversal) algorithm [5] uses a vertical dataset (e.g. Item1: {T001, T002}, Item3: {T001}, Item4:{T002}). The Eclat algorithm only scans the dataset once. It finds the frequent itemsets by taking the intersections of the transaction sets.

The Eclat algorithm takes advantage of scanning the dataset only once. However, when the dataset is large, and the minsup is set to a low value, the TID associated with each itemset may become very long. In fact, the results can be larger than the original dataset; therefore, they may not fit into the memory.

Other association rule learning algorithms

There are three categories of association rule mining/frequent itemset mining algorithms [10]: Apriori-based algorithms, tree-based algorithms, and pattern growth algorithms. The Apriori algorithm, the Eclat algorithm, and the FP-Growth algorithm are the most popular algorithms for the three categories, respectively.

In the Apriori-based algorithm category, proposed by Agrawal and Srikant in [3] the AprioriTID algorithm is similar to Apriori, except that it generates C_k-bar and it mines the frequent itemsets from there instead of the dataset. The Apriori Hybrid algorithm [3] is a combination of the Apriori algorithm and the AprioriTID algorithm. The DHP (direct hashing and pruning) algorithm [11] uses a hash function to distribute the itemsets into buckets. If a bucket has the support lower than the minsup, then the bucket is discarded. The MR-Apriori [12] and HP-Apriori [13] algorithms are distributed versions of the Apriori algorithm. The MR-Apriori uses the MapReduce model on the Hadoop platform. They enable parallel execution of the Apriori algorithm.

The tree-based algorithms, represented by the Eclat algorithm, find the frequent itemset by constructing a lexicographic tree. The AIS algorithm [6] and the SETM algorithm [14] are the two earliest association rule mining algorithms in this category. Reference [3] shows that the Apriori algorithm beats them in running time. The TreeProjection algorithm [15] counts the supports of the frequent itemsets and uses the nodes of a lexicographic tree as the representation of these support numbers. The TM algorithm [16] maps the TID of each transaction to transaction intervals before performing intersections between these intervals.

Lastly, the algorithms in the pattern growth category focus on frequent patterns. The P-Mine algorithm [17] is a parallel computing algorithm that utilizes the VLDBMine data structure to store the dataset and speed up the distribution of data, while the LP-Growth algorithm [18] makes use of an array-based linear prefix tree to improve the memory efficiency. The Can-Mining algorithm [19] finds the frequent itemsets from a canonical-order tree, which speeds up the tree traversal process when the number of frequent itemsets is low. Finally, the EXTRACT algorithm [20] uses the theory of Galois lattice to derive association rules.

The algorithms discussed above, unfortunately, have scalability problems. The Apriori-based algorithms, represented by the Apriori algorithm, have to go through the expensive candidate generation and support counting process. This causes a disadvantage in running time. The tree-based and the pattern-growth type algorithms often suffer from excessive usage of memory. For example, the FP-Growth algorithm could build a complex FP-Tree which does not fit into the memory.

We show the scalability problems of the Apriori algorithm and the FP-Growth algorithm in the experiment part of this paper. Both of the algorithms take too long to finish for most of the tested datasets. The need for faster, frequent itemset mining is urgent due to the vastly available data today. Companies and institutions have allocated many resources in data mining, and they need a time-saving, resource-saving solution. In addition, real-time data analysis plays an important role in government [21], scientific [7], and military [8, 9] applications. The experiments part of this paper shows that the current algorithms represented by the Apriori algorithm and the FP-Growth algorithm are not fast enough to complete real-time data analysis. The scalability problems of most existing association rule mining algorithms have also been addressed in [22] that is focused on paralleled computing of association rules whereas this paper presents a scalable algorithm that is suitable for a single machine also.

Graph partitioning algorithms

One of the key steps in the SARL heuristic that we will introduce shortly is to partition the IAG (item association graph, "Our solution" section, part 7) into k balanced partitions. An efficient graph partitioning algorithm is crucial since the balanced graph partitioning problem is NP-complete [23]. We have implemented three algorithms and compared them for the partitioning costs and running times. They are the recursive version of the Kernighan-Lin Algorithm [24], the Multilevel k-way Partitioning Algorithm (MLkP) [25], and the recursive version of the Spectral Partitioning Algorithm [26]. Other graph partitioning algorithms include the Tabu search-based MAGP algorithm [27] and the flow-based KaFFPa algorithm [28].

The Kernighan-Lin algorithm swaps the nodes assigned to both partitions and finds the largest decrease in the total cut size. The Multilevel k-way Partitioning algorithm (MLkP) uses coarsening-partitioning-uncoarsening/refining steps to shrink a graph into a much smaller graph. After partitioning, the graph is rebuilt to restore the original graph. A single global priority queue is used for all types of moves. The Spectral Partitioning Algorithm finds splitting of the values such that the vertices in a graph can be partitioned with respect to the evaluation of the Fiedler vector.

Experiments are conducted by us to compare the three algorithms. The datasets provided by Christopher Walshaw at the University of Greenwich [29] are used. The datasets are as large as possible while the partitioning algorithms can finish in a reasonable time on the tested machine. We also run experiments on complete graphs with 30 and 300 nodes. Each dataset is tested four rounds with the number of partitions (k) being 2, 4, 8, and 16.

As shown in Table 1, the running times are highlighted in the red box. We can tell from average running time(the last row) that the MLkP algorithm has the highest speed in general. It is 560 times faster than the spectral partitioning algorithm and even faster than the recursive Kernighan-Lin algorithm. The spectral partitioning algorithm has, in general, the best partition quality. It is 1.3 times better than MLkP and much better than the recursive Kernighan-Lin algorithm. The recursive Kernighan-Lin algorithm takes too long to complete all five datasets. It also shows serious scalability issues for complete graphs.

Table 1 Results of the experiment that compare MLkP, Kernighan-Lin, and Spectral Partitioning algorithms

A scalable association rule learning heuristic for large datasets

Abstract

Introduction

Contributions

Related work

The Apriori algorithm

The FP-growth algorithm

The Eclat algorithm

Other association rule learning algorithms

Graph partitioning algorithms

Our solution

Definitions

A scalable heuristic algorithm—SARL-heuristic

An example

Formal description of the SARL heuristic

Finding frequent 2 itemsets using the Apriori algorithm or dirct_gen algorithm

Construction of the item association graph

Partition the IAG using the multilevel k-way partitioning algorithm (MLkP)

Transaction partitioning

Selecting an algorithm on transaction partitions

Time complexity and space complexity

Time complexity

Space complexity

Error bound

Initial selection of number of partitions, k

Benefits of having datasets fit into the memory

Theorems and proofs

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Experiments and results

The bible dataset

The T10I4D100K dataset

The T40I10D100K dataset

Conclusions and future work

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors' information

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords