A scalable association rule learning and recommendation algorithm for large-scale microarray datasets

Association rule learning algorithms have been applied to microarray datasets to find association rules among genes. With the development of microarray technology, larger datasets have been generated recently that challenge the current association rule learning algorithms. Specifically, the large number of items per transaction significantly increases the running time and memory consumption of such tasks. In this paper, we propose the Scalable Association Rule Learning (SARL) heuristic that efficiently learns gene-disease association rules and gene–gene association rules from large-scale microarray datasets. The rules are ranked based on their importance. Our experiments show the SARL algorithm outperforms the Apriori algorithm by one to three orders of magnitude.

Learning (SARL) algorithm that focuses on the learning speed and the importance of rules derived.
As more microarray datasets are generated every day, investigators seeking potential associations between genes and between genes and diseases need a tool to find candidate rules across multiple datasets quickly. SARL is such a tool that provides scalable association rule learning and rule ranking. After having a general idea of candidate rules, investigators may choose to run a more time-costly algorithm that precisely calculate the rules on a few selected datasets. Therefore, by quickly reducing the scope of datasets and giving a general idea to the investigator, our algorithm can reduce the total time needed to find a target rule and increases the success rate.

Contributions
There are three main contributions in this paper: 1. The SARL heuristic with the divide and conquer methodology can increase the efficiency and scalability of association rule mining but still maintain reasonable accuracy. 2. The rule ranking algorithm calculates the importance and ranks the rules, so the investigator does not have to search through millions of rules. 3. We consider gene-disease associations. Each important association rule between genes and disease can be identified and highlighted in the result. The Apriori algorithm [1], introduced by Agrawal and Srikant, was the first efficient association rule learning algorithm. It incorporates various techniques to speed up the process as well as to reduce the use of memory. For example, the L k-1 × L k-1 method used in the candidate generation process can reduce the number of candidates generated, and the pruning process can significantly reduce the number of possible candidates at each level.
One of the most important mechanisms in the Apriori algorithm is the use of the hash tree data structure. It uses this data structure in the candidate support counting phase to reduce the time complexity from O(kmn) to O(kmT + n), where k is the average size of the candidate itemset, m represents the number of candidates, n represents the number of items in the whole dataset, and T is the number of transactions.
The major advantage of the Apriori algorithm comes from its memory usage because only the k-1 frequent itemsets, L k-1 , and the candidates in level k, C k , need to be stored in the memory. It generates the minimum number of candidates based on the L k−1 × L k−1 (described in [1]) and the pruning method, and it stores them in the compact hash tree structure. In case the candidates fill up the memory from the dataset and a low minsup setting, the Apriori algorithm does not generate all the candidates to overload the memory. Instead, it generates as many candidates as the memory can hold.

The FP-Growth Algorithm
The Frequent Pattern Growth algorithm was proposed by Han et al. in 2000 [2]. It uses a tree-like structure (called Frequent Pattern Tree) instead of the candidate generation method used in the Apriori algorithm to find the frequent itemsets. The candidate generation method finds the candidates of the frequent itemsets before reducing them to the actual frequent itemsets through support counting. The algorithm first scans a dataset and finds the frequent one itemsets. Then, a frequent pattern tree is constructed by scanning the dataset again. The items are added to the tree in the order of their support. Once the tree is completed, the tree is traversed from the bottom, and a conditional FP-Tree is generated. Finally, the algorithm generates the frequent itemsets from the conditional FP-Tree. The FP-Growth algorithm is more scalable than the Apriori algorithm in most cases since it makes fewer passes and does not require candidate generation. However, it suffers from memory limitations since the FP-Tree is relatively complex and may not fit in the memory. Traversing the complexed FP-Tree may also be time-expensive if the tree is not compact enough.

Graph Partitioning Algorithms
One of the key steps in the SARL heuristic that we will introduce shortly is to partition the IAG (item association graph, section 7) into k balanced partitions. An efficient graph partitioning algorithm is crucial since the balanced graph partitioning problem is NP-complete [3]. We have implemented three algorithms and compared them for the partitioning costs and running times. They are the recursive version of the Kernighan-Lin Algorithm [4], the Multilevel k-way Partitioning Algorithm (MLkP) [5], and the recursive version of the Spectral Partitioning Algorithm [6].
Other graph partitioning algorithms include the Tabu search based MAGP algorithm [7] and the flow-based KaFFPa algorithm [8]. The Kernighan-Lin algorithm swaps the nodes assigned to both partitions and finds the largest decrease in the total cut size. The Multilevel k-way Partitioning algorithm (MLkP) uses coarsening-partitioning-uncoarsening/refining steps to shrink a graph into a much smaller graph. After partitioning, the graph is rebuilt to restore the original graph. A single global priority queue is used for all types of moves. The Spectral Partitioning Algorithm finds splitting of the values such that the vertices in a graph can be partitioned with respect to the evaluation of the Fiedler vector.
Experiments are conducted by us to compare the three algorithms. The datasets provided by Christopher Walshaw at the University of Greenwich [9] are used. They are chosen because they are, on the one hand, large enough for us to study scalability and, on the other hand, manageable by our machine.
The datasets are desired to be as large as possible while the partitioning algorithms can finish in a reasonable time on the tested machine. We also run experiments on complete graphs with 30 and 300 nodes. Each dataset is tested four rounds with the number of partitions (k) being 2, 4, 8, and 16.
As shown in Fig. 1, the MLkP algorithm has the highest speed in general. It is 560 times faster than the spectral partitioning algorithm and even faster than the recursive Kernighan-Lin algorithm. The spectral partitioning algorithm has, in general, the best partition quality. It is 1.3 times better than MLkP and much better than the recursive Kernighan-Lin algorithm. The recursive Kernighan-Lin algorithm takes too long to complete all five datasets. It also shows serious scalability issues for complete graphs.
Considering that the MLkP algorithm has the best overall performance, we choose to use this algorithm for graph partitioning in our algorithm.

1) Applications Related to Association Rule Learning Algorithms on Microarray Data
Both Apriori and FP-Growth algorithms have been applied to microarray datasets [10], according to [10]. The main challenge for applying an existing association rule algorithm to a microarray dataset is the large number of items per transaction. Almost all microarray datasets have significantly more genes than assays. The existing approaches transpose the microarray dataset into transactional datasets. After transposing the dataset, we have significantly more columns than rows. This greatly increases the complexity of the existing association rule algorithms because they are designed for datasets with more rows than columns. With the existing algorithms, it can be shown that the FP-Growth algorithm performs slightly (around 10%) better than the Apriori algorithm. However, the author points out that a more scalable algorithm is needed to overcome the time and space complexities.
There are variations of association rule learning algorithms on microarray datasets. The FARMER algorithm [11] finds interesting rule groups instead of individual rules. The algorithm is efficient for finding some association rules between genes and labels.
Huang et al. propose a ternary discretization approach [12] that converts each gene expression level to one of the three levels: under-expressed, normal, and overexpressed. Compared to traditional binary classification methods, the ternary discretization approach captures the overall gene expression distribution to prevent serious information loss.
In summary, the existing variations of a traditional approach such as Apriori or FP-Growth algorithms have issues related to scalability or coverage. The algorithms and heuristics reported in this paper tolerate"" certain accuracy for better scalability so the investigator may navigate a dataset iteratively (We call it "iterative investigation" in this paper) to converge in the search process quickly.

2) Other Data Mining Algorithms on Microarray Data
In addition to association rule learning algorithms, other data mining algorithms that address different problems have also been applied to microarray data. Many researchers have studied classification problems on microarray data. The most popular application is classifying diseases based on gene expression levels [13]. Many algorithms have been applied to solve classification problems. The most studied algorithms include the Bayesian network [14], Support Vector Machine [15], and k-Nearest Neighbor [16]. These problems usually take gene expression levels as the input (features) and predict the disease(s) associated with an assay. It can also be used to classify tumors based on gene expression levels.

Dataset reduction
It may not be very useful for a large dataset with hundreds of thousands of genes to find rules that cover all the genes. Since we may be only interested in over-expressed and under-expressed genes and all diseases, normally expressed genes could be eliminated from our dataset. We can also adjust the threshold of over-expressed and underexpressed genes to classify fewer genes as over/under-expressed ones.
A common approach is the following method that converts the gene expression levels into log-scale values [17].
First, we arbitrarily pick a reference assay and calculate the relative expression levels based on the reference assay. Assuming the absolute gene expression levels of the reference assay is E r1 , E r2 . . . E rn , we can calculate the relative gene expression levels for another assay A as: . . E An are absolute gene expression levels for assay A. We can use the above method to calculate the relative gene expression levels for all other assays.
Next, the relative gene expression levels are used to find the log-scale gene expression levels. For each assay A, the log-scale gene expression levels are calculated as: In the end, a user-defined threshold h is used to filter out some normally expressed expression levels. A lower h value means more gene expression levels are kept, and the computation time is longer. This step can dramatically reduce the size of the dataset while keeping valuable information.

Converting into transactional datasets
Microarray datasets are matrices of data. Each row of a matrix represents a gene, while each column represents an assay. However, to perform association rule learning, we need to convert microarray datasets into transactional datasets. Each row is an assay in a transactional dataset, and each "transaction" has a different number of genes.
Our algorithm transposes the matrix that we obtain from the earlier steps. Next, each log-scale gene expression level is converted into a ternary item [12]. If the level exceeds the positive threshold, an item +G replaces the corresponding expression level where G is the gene number. Likewise, if the log-scale expression level is less than the negative threshold, it will be replaced by −G.
For example, if we have an assay that has genes G1, G2, and G3 with log-scale expression levels {-100, 10, 300}, respectively. Assuming the thresholds are -50 and + 50, we convert the expression levels into {-G1, + G3}. G2 is not included in the above transaction because its expression level is not significant.

Extracting disease information
We introduce disease information to the transactional dataset so that our association rule learning algorithm can derive gene-disease association rules. The prior approaches do not address disease information. To find association rules that involve genes and diseases, we need to convert the disease information associated with each assay into an item.
First, the disease information is extracted from the sample information. A disease, in this case, can be a specific disease or "normal." If the disease information is provided, our algorithm will copy the disease name as an item name to the corresponding assay (transaction). Therefore, for each transaction, there are one or more gene items and a disease item.
For example, an assay is labeled as "Tumor" in the original dataset and calculated in the above steps to have items {−G1, + G3}. The algorithm will add "Tumor" to the transaction to have {−G1, + G3, "Tumor"}.

Calculating gene importance
An important aspect of association rule ranking is evaluating the importance of each gene. The following is our approach for calculating gene importance, of which the importance of a gene can be viewed as the average degree of over/under expression in the dataset. We also want to consider +G and −G individually since they are considered different items in the transformed dataset. We define the gene importance for gene g, E g , as below: In the above, K j is the gene expression level of gene j, K g is the gene expression level for gene g. The first part, m j=1 K j m , calculates the average expression level of all genes, and the second part, t i=1 K g t , finds the average expression level of gene g. The difference between the two is the deviation for gene g. If the deviation is high, the expression level of a gene is outstanding, and we can say it is important.
For example, if the average gene expression level of all the genes is 20, and we calculate that gene G1 has an average expression level of 100, while G2 has an average expression level of 2. Then E g for G1 and G2 are 80 and 18, respectively. Therefore, G1 should be ranked above G2.

Definitions
The following definitions are used in this paper: 1) K-itemset: an itemset with k items 2) Support: number of occurrences of an itemset in the dataset 3) Minsup: the minimum requirement of support. The user usually provides this. Itemsets with support < minsup are eliminated. 4) Confidence: the indication of robustness of a rule in terms of percentage.
Confidence(XY) = support(X ∪ Y )/support(X) 5) Minconf: the minimum requirement of confidence. The user usually provides this.
Rules with confidence < minconf are eliminated. 6) Item-Association Graph: a graph structure that stores the frequent associations between pairs of items. 7) Balanced K-way Graph Partitioning Problem: Divide the nodes of a graph into k parts such that each part has almost the same number of nodes while minimizing the number of edges/sum of the edge weights that are cut off. 8) E g : importance of gene g. 9) I r : Importance of rule r.

A scalable heuristic algorithm-SARL-heuristic
The following is an outline of our scalable heuristic [18].
Step 1: Find frequent one and two itemsets using the Apriori algorithm (when minsup is high) or the direct generation method (when minsup is low).
Step 2: Construct the item association graph (IAG) from the result of step 1.
Step 3: Partition the IAG using the multilevel k-way partitioning algorithm (MLkP).
Step 4: Partition the dataset according to the result of step 3.
Step 5: Call the modified Apriori algorithm or the FP-Growth algorithm to mine frequent itemsets on each transaction partition.
Step 6: Find the union of the results found from each partition.
Step 7: Generate association rules by running the Apriori-ap-genrules on the frequent itemsets found from step 6.

An example
Suppose the microarray dataset in Table 1 is given, and minsup is set to 0.2 (or 20%, or 8 * 0.2 ≈ 2 occurrences), and minconf is set to 0.7 (or 70%). We select assay 8 as the reference assay then calculate the relative expression levels. The results are shown in Table 2.
Next, we calculate the log-scale gene expression levels by taking log base 2: log(x, 2) where x is the relative expression level. The results are shown in Table 3.   Then we normalize the expression levels by applying a threshold to the log-scale expression levels. Here, we choose the threshold to be 1, meaning all log-scale expression levels that are above 1 or below −1 are set to 1 and −1, respectively. Levels between −1 and 1 are set to 0. The results are shown in Table 4.
Next, we transpose the matrix to prepare it for the transactional dataset. Each row is now an assay and each column is a gene. The results are shown in Table 5.
Finally, we convert the transposed matrix to a transactional dataset, shown in Table 6, each expression level that equals −1 or 1 is transformed into an item. In Table 6, the items of each transaction include the genes that are over-expressed (denoted by a + symbol) and genes that are under-expressed (denoted by a-symbol). For example, a transaction with TID T000 is an assay that contains three significantly (over or under) expressed genes, gene 1 (under-expressed), gene 2 (over-expressed), and gene 5 (over-expressed). Now, we use the Apriori algorithm to find the frequent two itemsets. As an intermediate step, the Apriori algorithm finds the frequent one-itemset first (shown in Table 7): The frequent two-itemsets are found afterward (shown in Table 8):  Table 5 Transposed matrix Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Next, we transform the above frequent two-itemsets into an item association graph (IAG), shown in Fig. 2: To construct the graph, we first take the itemset {-1, + 2} with support 3. For this, we create node −1 and node + 2 corresponding to the two items in the itemset. The edge between node −1 and node + 2 has weight 3, representing the support of the itemset. The process is repeated for every frequent two-itemset found in the previous step.
Subsequently, we use the multilevel k-way partitioning algorithm (MLkP) to partition the IAG. In this case, the number of nodes is small, so we only bisect the graph by setting k = 2. The result is shown in Figs. 3 and 4.
The MLkP algorithm divides the IAG into two equal or almost equal sets in linear time while the sum of the weights of edges that are cut off is the minimum.
Next, we partition the dataset according to the partitions of the IAG, as shown in Tables 9 and 10. Each transaction partition has all the items from the corresponding IAG partition. However, since the algorithm has already found all the frequent one and two itemsets, a transaction is not added to a transaction partition if the transaction has less than three items. For example, T000: {−1, + 2} is not added to the transaction partition 1, since it only has two items. Some items in the original dataset may not appear in any of the transaction partitions, because the infrequent one/two-itemsets are dropped in the IAG. This simplifies the subsequent computations. In this example, however, all the items are kept in the IAG because the IAG is a relatively dense graph. Tables 11 and 12 show the transaction partitions: The next step is to pick the best algorithm and use it to find the frequent k-itemsets with k > 2. For this example, we choose the modified Apriori algorithm because it is faster for mining small datasets as it avoids the process of finding the one and two-itemsets again. The results from partition 1 are shown in Table 11: Since the modified Apriori algorithm starts with three-itemsets, there are no additional frequent itemsets in the first partition. Table 12 shows the results found in transaction partition 2: The final results (shown in Table 13) of frequent itemsets are simply the union of  Tables 7, 8, and 11: After running the Apriori-ap-genrules algorithm, the association rules can be found in Table 14.
All the frequent itemsets generated by the SARL heuristic are sound, meaning each frequent itemset generated indeed is correct, and the support number is accurate. However, it is possible that some frequent itemsets cannot be found by the SARL heuristic, as

The SARL (scalable association rule learning) heuristic
We introduced the SARL heuristic in our previous paper [18]. SARL is a highly scalable heuristic algorithm for association rule learning problems on horizontal transactional datasets. In this paper, a modified version of SARL serves as the core of our algorithm. A summary of the SARL heuristic is shown below. A more detailed and formal description, including the pseudo-code, can be found in the paper that introduces SARL.
The Apriori algorithm or the direct counting and generation algorithm is used to generate frequent one and two itemsets, depending on the size of the dataset. Apriori is faster on very large datasets, where the direct counting and generation algorithm is faster on small and medium-sized datasets. SARL then builds the item association graph (IAG) based on the frequent two itemsets. Each frequent two itemset is converted into an edge on the IAG, and each item in the itemset is converted into a node. Then, the MLkP algorithm is used to partition the IAG into k subgraphs. The dataset is partitioned based on the subgraphs. Each partition of the dataset should contain all the items (nodes) of a subgraph of the IAG across all the transactions of the datasets. During this process, some transactions may end up undivided, and all the possible frequent itemsets related to these transactions will be preserved in later stages. Next, the Apriori algorithm or the FP-Growth algorithm is selected based on an analysis of the dataset to ensure the most efficient execution. Finally, SARL calls the selected algorithm on each dataset partition to complete the computation. If the Apriori algorithm is selected, SARL will call the modified Apriori that starts from the frequent three itemsets computation to avoid any redundant work.
The SARL heuristic divides the dataset into k partitions. The size of each partition should be smaller than 1 k × size of the dataset because the dataset is partitioned according to IAG, and the number of items (nodes) in the IAG should be smaller than the number of unique items in the dataset. A more detailed explanation can be found in the Transaction Partitioning section of the Appendix section. This indicates that each dataset partition can always fit into the memory. All later steps of the SARL heuristic significantly benefit from processing the dataset in the memory rather than on the disk.
We also conducted a thorough time and space complexity analysis and an error-bound analysis that can be found in the Appendix section of this paper.

Ranking of association rules
Considering the nature of microarray datasets, the number of unique items (genes) is usually large. This leads to a tremendous number of association rules. Therefore, it is necessary to rank the association rules by their importance so that the results can be easily used. The goal of this study aims to help scientists explore and validate new association rules more efficiently.
To achieve the goal, we introduce the following measurement of the importance of rule r: x y: In the above, I r is the importance of the rule, L r is the lift of the rule r , n is the number of unique genes included in rule r, E gr is the RMS deviation of the expression level of gene g that is included in rule r, K j is the gene expression level of gene j, where j represents all other genes; K g is the gene expression level for gene g, and B r is the bias applied to this rule. The bias should be positive if a disease is in the rule.
The intuition here is to emphasize three factors that are related to the importance of an association rule. The first is the lift of a rule [19]. A higher lift indicates the rule has a higher response compared to the other rules. If the lift value is large, then the antecedent and the consequent of the rule are more dependent on each other, and this further indicates a high significance of the rule. The second factor is the average significance of each gene included in the rule. If all or most of the genes are significant, then the rule is likely to be more important. When we convert the microarray dataset into a transactional dataset, the absolute gene expression levels are converted into relative expression levels, and some information related to the absolute levels is missing. Here, we reconsider the influence of the absolute gene expression levels and calculate the average significance of a gene-based on it. E g , the deviation of the average absolute gene expression level, is calculated by taking the difference of the average absolute gene expression levels of all genes and the average absolute gene expression level of gene g. The significance of a rule contributed by its genes is then calculated by taking the average of each E g that is included in rule r.
For example, assuming the rule G1 → G2 is an association rule found by the SARL heuristic. Genes 1, 2, and 3 have expression levels of 10, 8, and 5, respectively. The rule has confidence of 0.7 and support of 10. Then we can calculate L r = conf (r) sup(y) = 0.07 , Since the rule does not involve a disease, bias is set to 0. Hence, I r = L r × Egr n + B r = 0.07 × 3.8+2.5 We incorporate the three most important measurements in the ranking of the rules. The lift measurement generally addresses the ranking of the significance of each rule, the average gene significance traces back to the microarray dataset and considers the importance of each gene, and the bias B r highlights the rules that involve disease information.

Experiments and results
We have designed and conducted experiments on small and large microarray datasets to demonstrate the scalability and accuracy of our algorithm. The experiments are based on the following configuration: All three datasets are downloaded from ArrayExpress [20]. We test the SARL algorithm on each of the datasets with various minsup configurations. Here, minsup refers to the minimum number of occurrences rather than the percentage of that.

E-MTAB-9030-microRNA profiling of muscular dystrophies dataset
The dataset has the following metrics:  Table 15.
According to Fig. 5, the SARL heuristic runs faster than the Apriori algorithm on all minsup configurations. We can see the running time becomes larger as the minsup goes down. For the test case where minsup is 2, the SARL algorithm performs 26 times faster than the Apriori algorithm. Figure 6 shows the accuracy of the SARL heuristic is between 0.62 and 0.67. The accuracy is calculated based on the 100 most important frequent itemsets. The results are 62% to 67% accurate based on the association rules derived by the Apriori algorithm with the 100 most important frequent itemsets. It seems the accuracy may be acceptable considering the purpose of this research and the speedup, i.e., to have a computational tool that can more quickly derive the important associations among genes for iterative investigation.

Molecular characterisation of TP53 mutated squamous cell carcinomas of the lung identifies BIRC5 as a putative target for therapy
The dataset has the following metrics: This dataset is larger than the previous one. Traditionally, finding association rules with the Apriori algorithm on the full dataset will take an extremely long time.
The result of the experiment is shown in Table 16. According to Fig. 7, the SARL heuristic performs similarly comparing to the Apriori algorithm on a minsup range between 10 to 3, and both algorithms can finish within 2 s. This is because the SARL heuristic has a small overhead, and the size of the processed data is very small on these minsup configurations. However, when it comes to minsup = 2, the SARL heuristic outperforms the Apriori algorithm by a large margin. The SARL heuristic finished the task in less than 45 s comparing to 279 s for Apriori. Figure 8 shows that the SARL heuristic has 100% accuracy across all minsup configurations. We believe the SARL heuristic performs better than the Apriori algorithm overall on this dataset because it achieves the same goal with a fraction of time.

E-MTAB-6703-A microarray meta-dataset of breast cancer
The dataset has the following metrics: This dataset is about ten times larger than the second dataset. However, based on the purpose of this research, we believe the total number of rules generated from the previous dataset is already overwhelmingly large. Therefore, we also reduced this dataset based on the method mentioned in this paper to speed up the calculation.
The experiment results are shown in Table 17. From Fig. 9, we can find a similar performance result as the previous datasets, but the SARL performs better for the larger dataset. The SARL heuristic is 700 times faster than the Apriori algorithm on minsup = 2. More surprisingly, according to Fig. 10, SARL is accurate on all minsup configurations.

Discussion and conclusions
In this paper, we proposed a new algorithm for association rule learning specifically designed for microarray datasets. The SARL heuristic algorithm utilizes the ternary discretization method, divide and conquer paradigm, graph theory, and graph partitioning algorithm to significantly speed up the association rule learning process compared to traditional algorithms. The algorithm also shows space efficiency. The rule ranking algorithm based on the importance saves time for researchers by showing the most important rules first. The rules found and ranked by the SARL heuristic cover both inter-genes   rules and gene-disease rules. We compared our algorithm with Apriori, the most commonly used association rule learning algorithm, through a series of experiments. The results show that our algorithm has a significant speedup while still maintains high accuracy. Some potential drawbacks of our algorithm include: 1. There is a small probability that some non-trivial rules are lost in the dataset partitioning stage. 2. For small datasets or high support, the performance of our algorithm may be similar or slightly higher than the Apriori algorithm due to the overhead.
In the future, we plan to extend our work with the following tasks: • Develop a parallel version of the SARL heuristic and its implementation. The transaction partitions can be considered as independent datasets, and we can easily run the modified Apriori algorithm or FP-Growth algorithm on each of the transaction partition in parallel and then merge the results (frequent three or higher itemsets) together along with the frequent one and two itemsets to obtain the total frequent itemsets. Each parallel processor does not need to communicate with others during the computation since all the information needed is already included in the local dataset. This would result in maximum utilization of each processor. • A better algorithm may be used to predict the proper thresholds of the ternary discretization. The current ternary discretization is based on empirical methods and may need several tests to find the best thresholds that reduce the dataset to a smaller size while keeping enough information. A statistical analysis of the dataset may help to decide the boundaries. Furthermore, we should consider incorporating deep neural network approach in this process to predict the best threshold. • Incremental Learning on Multiple Datasets: Nowadays, new microarray datasets are added to databases around the world on a daily basis. Among them, some of them focus on the same sets of genes, and others may have overlapping gene components. This brings an interesting question: can we learn association rules across multiple microarray datasets to get a larger number of more convincing rules? The answer is yes. It is possible and quite useful to learn association rules from multiple datasets. In fact, a prominent advantage of the SARL algorithm is the ability to do incremental learning across multiple datasets. Assume we already examined and ran the SARL algorithm to learn association rules on datasets A, which includes genes G1, G2, and G3. Now, a dataset B is added with genes G1, G2, G3, and G4. We can extend the association rules from dataset A on G1, G2, and G3 in dataset B. G4 is removed from B since we cannot associate G4 to dataset A. Firstly, we compare the reference conditions (assays) between datasets A and B and find a coefficient for each gene expression level: A1 through A3 are expression levels of reference condition in dataset A, B1 through B3 are expression levels of reference condition in dataset B. c1, c2, and c3 are coefficients we want to find.
Next, all expression levels in dataset B are divided by the corresponding coefficient: where E i is gene expression level with gene number i , and c i is the coefficient found in the previous step for gene i . Now, expression levels in dataset B are adjusted for the differences in experimental conditions, we then are ready to run the SARL algorithm on dataset B. The following is an example of combining two datasets: According to Tables 18 and 19, assuming Assay2 is selected to be the reference condition in both datasets. We may calculate c1, c2, and c3 as: c1 = 7/2 = 3.5 c2 = 2/5 = 0.4 c3 = 1/7 = 0.14 After dividing all the expression levels in dataset B by the corresponding coefficient we have a combined dataset (shown in Table 20):  not require candidate generation from L1, and avoids many unnecessary membership testing during support counting. However, this method is not efficient on large datasets since it does not use pruning and saves all two itemsets.
In the SARL heuristic, we ask the user for a threshold of the dataset size. If the dataset is larger than the threshold, the SARL heuristic will use the modified Apriori algorithm. Otherwise, it will use the direct_gen algorithm to compute the frequent one and two itemsets.

Construction of the item association graph
The item association graph G is constructed based on the two itemsets generated by the Apriori algorithm. G is an undirected, weighted graph. A node Vi is created for each unique item i in the two itemsets T with the maximum item number being n.
The edges E in graph G are formed for each itemset in T: The weight of each edge E ij is equal to the support of itemset {i, j} in T:

Partition the IAG using the multilevel k-way partitioning algorithm (MLkP)
The Multilevel k-way partitioning (MLkP) algorithm [17] is an efficient graph partitioning algorithm. The time complexity is O(E), where E is the number of edges in the graph.
The general idea of MLkP is to shrink (coarsen) the original graph into a smaller graph, then partition the smaller graph using an improved version of the KL/FM algorithm. Lastly, it restores (uncoarsen) the partitioned graph to a larger, partitioned graph.
METIS is a software developed by Karypis at the University of Minnesota [18]. It includes an implementation of the MLkP algorithm that takes a graph as the input and outputs groups of nodes separated after the partition.

Transaction partitioning
Based on the results of the MLkP algorithm that divide the items into groups P 1 , P 2 ,… ,P m , we can partition the transactions into the same number of groups, where each group D i contains only the items in partition P i . This guarantees that each partition fits into the memory. Refer to paper [18] for more details.

Selecting an algorithm on transaction partitions
One of the benefits that come with our solution is that the association rule learning on each transaction partition can be optimized by using an algorithm that best fits the partition.
Since the modified Apriori algorithm has already computed the one itemsets and two itemsets during the preparation phase, the candidate generation feature of the Apriori algorithm is handy in this case. We modify the Apriori algorithm to skip the frequent one/two itemsets finding stages and start with the frequent three itemsets from the transaction partitions. This modification is particularly helpful when minsup is set to a high value so that the expected number of itemsets is limited after the two itemsets are found.
The average transaction length provides a fast and straightforward reference for selecting the best algorithm for each transaction partition. The SARL heuristic choose between the modified Apriori algorithm and FP-Growth algorithm to complete the computation of frequent itemsets.

Time complexity and space complexity
The theoretical time and space complexity of the Apriori algorithm is O(2 d ) where d is the number of unique items in the dataset.

Time complexity
If the modified Apriori algorithm is selected, the theoretical time complexity for each partition is O(2 1.03d/k ) where the coefficient 1.03 comes from the 3% maximum imbalance of the partitions caused by the MLkP algorithm. The total running time for all the partitions is O k * 2

Space complexity
If the modified Apriori algorithm is selected, the theoretical space complexity for each partition is O 2 ) space comparing to the Apriori algorithm. The exponential reduction of space usage comes from the smaller number of unique items in each transaction partition. If the modified Apriori is chosen to mine frequent itemsets from the transaction partitions, it only generates a smaller number of candidates for each transaction partition, since it does not consider items in other partitions.

Error bound
The SARL heuristic sacrifices some precision to obtain the speed up. However, every frequent itemset found by the algorithm is correct, and the support associated with each frequent itemset is also correct. The heuristic may miss some trivial frequent itemsets, i.e., the itemsets with low support. During the IAG partition phase, the MLkP algorithm makes cuts on the IAG to minimize the sum of the weights of the edges that are cut off. This feature helps to prevent large weights from cut off, while some trivial, small-weight (support) edges may be lost.
We can make a rough estimation by introducing a parameter P out , the ratio of the edges cut off in the IAG. P out = E cut E total . This parameter is determined by the characteristics of a dataset, the minsup choice, and the number of partitions we choose. P out is also a rough estimation of the error rate for the frequent two or more itemsets. Assume the ratio of the frequent two or more itemsets found is P m , P m = #frequent2+itemsets #totalfrequentitemsets , then the total error bound can be computed as Error total = P m * P out .

Benefits of having datasets fit into the memory
The transaction partitions are guaranteed to be small enough to fit into the memory. Therefore, any operations performed on these in-memory datasets should be faster than before. For example, the Apriori algorithm makes the number of passes on the dataset equal to the maximum length of frequent itemsets. Each of these passes requires reading the dataset from the disk. With our solution, the SARL heuristic makes at most two passes to the dataset. The first pass is to generate the frequent one and two itemsets, and in the second pass, the algorithm brings a fraction of the dataset into the memory. All further passes are made directly in the memory, resulting in speedup.
Abbreviations K-itemset: An itemset with k items; Minsup: The minimum requirement of support. The user usually provides this. Itemsets with support < minsup are eliminated.; Minconf: The minimum requirement of confidence. The user usually provides this. Rules with confidence < minconf are eliminated.; IAG: Item-Association Graph; MLkP: Multilevel k-way Partitioning algorithm; SARL: Scalable Association Rule Learning; E g : Importance of gene g.; I r : Importance of rule r.