 Research
 Open Access
 Published:
A scalable association rule learning and recommendation algorithm for largescale microarray datasets
Journal of Big Data volume 9, Article number: 35 (2022)
Abstract
Association rule learning algorithms have been applied to microarray datasets to find association rules among genes. With the development of microarray technology, larger datasets have been generated recently that challenge the current association rule learning algorithms. Specifically, the large number of items per transaction significantly increases the running time and memory consumption of such tasks. In this paper, we propose the Scalable Association Rule Learning (SARL) heuristic that efficiently learns genedisease association rules and gene–gene association rules from largescale microarray datasets. The rules are ranked based on their importance. Our experiments show the SARL algorithm outperforms the Apriori algorithm by one to three orders of magnitude.
Introduction
Microarray technology has been widely used in bioinformatics. It efficiently measures gene expression levels for a large number of genes. Therefore, a huge amount of data can be generated from microarray datasets. Microarray datasets, after converting to transactional datasets, usually have a large number of columns (genes) and a small number of rows (assays). Since the time complexity of any precise association rule learning algorithm is \({2}^{d}\), where d is the number of unique items (genes, in this case), such a large number of genes causes a huge challenge for all existing association rule learning algorithms. For a large microarray dataset, it is impractical to apply these algorithms to find all association rules (Fig. 1).
Researchers interested in deriving association rules from microarray datasets are most likely not to use every association rule. Furthermore, a large number of genes also results in an even higher number of association rules. The computer memory, which is limited compared to the disk space, can easily be used up to run a current association rule learning algorithm. With these in mind, we propose the Scalable Association Rule Learning (SARL) algorithm that focuses on the learning speed and the importance of rules derived.
As more microarray datasets are generated every day, investigators seeking potential associations between genes and between genes and diseases need a tool to find candidate rules across multiple datasets quickly. SARL is such a tool that provides scalable association rule learning and rule ranking. After having a general idea of candidate rules, investigators may choose to run a more timecostly algorithm that precisely calculate the rules on a few selected datasets. Therefore, by quickly reducing the scope of datasets and giving a general idea to the investigator, our algorithm can reduce the total time needed to find a target rule and increases the success rate.
Contributions
There are three main contributions in this paper:

1. The SARL heuristic with the divide and conquer methodology can increase the efficiency and scalability of association rule mining but still maintain reasonable accuracy.

2. The rule ranking algorithm calculates the importance and ranks the rules, so the investigator does not have to search through millions of rules.

3. We consider genedisease associations. Each important association rule between genes and disease can be identified and highlighted in the result.
Related work
Several association rule learning algorithms have previously been applied to microarray datasets. The fundamental ones include the Apriori algorithm and the FPGrowth algorithm. Researchers have also created other algorithms or heuristics that find association rules with an approximation methodology.

1.
The Apriori Algorithm
The Apriori algorithm [1], introduced by Agrawal and Srikant, was the first efficient association rule learning algorithm. It incorporates various techniques to speed up the process as well as to reduce the use of memory. For example, the L_{k1} × L_{k1} method used in the candidate generation process can reduce the number of candidates generated, and the pruning process can significantly reduce the number of possible candidates at each level.
One of the most important mechanisms in the Apriori algorithm is the use of the hash tree data structure. It uses this data structure in the candidate support counting phase to reduce the time complexity from O(kmn) to O(kmT + n), where k is the average size of the candidate itemset, m represents the number of candidates, n represents the number of items in the whole dataset, and T is the number of transactions.
The major advantage of the Apriori algorithm comes from its memory usage because only the k1 frequent itemsets, L_{k1}, and the candidates in level k, C_{k}, need to be stored in the memory. It generates the minimum number of candidates based on the \({L}_{k1}\times {L}_{k1}\) (described in [1]) and the pruning method, and it stores them in the compact hash tree structure. In case the candidates fill up the memory from the dataset and a low minsup setting, the Apriori algorithm does not generate all the candidates to overload the memory. Instead, it generates as many candidates as the memory can hold.

2.
The FPGrowth Algorithm
The Frequent Pattern Growth algorithm was proposed by Han et al. in 2000 [2]. It uses a treelike structure (called Frequent Pattern Tree) instead of the candidate generation method used in the Apriori algorithm to find the frequent itemsets. The candidate generation method finds the candidates of the frequent itemsets before reducing them to the actual frequent itemsets through support counting.
The algorithm first scans a dataset and finds the frequent one itemsets. Then, a frequent pattern tree is constructed by scanning the dataset again. The items are added to the tree in the order of their support. Once the tree is completed, the tree is traversed from the bottom, and a conditional FPTree is generated. Finally, the algorithm generates the frequent itemsets from the conditional FPTree.
The FPGrowth algorithm is more scalable than the Apriori algorithm in most cases since it makes fewer passes and does not require candidate generation. However, it suffers from memory limitations since the FPTree is relatively complex and may not fit in the memory. Traversing the complexed FPTree may also be timeexpensive if the tree is not compact enough.

3.
Graph Partitioning Algorithms
One of the key steps in the SARL heuristic that we will introduce shortly is to partition the IAG (item association graph, section 7) into k balanced partitions. An efficient graph partitioning algorithm is crucial since the balanced graph partitioning problem is NPcomplete [3]. We have implemented three algorithms and compared them for the partitioning costs and running times. They are the recursive version of the KernighanLin Algorithm [4], the Multilevel kway Partitioning Algorithm (MLkP) [5], and the recursive version of the Spectral Partitioning Algorithm [6]. Other graph partitioning algorithms include the Tabu search based MAGP algorithm [7] and the flowbased KaFFPa algorithm [8].
The KernighanLin algorithm swaps the nodes assigned to both partitions and finds the largest decrease in the total cut size. The Multilevel kway Partitioning algorithm (MLkP) uses coarseningpartitioninguncoarsening/refining steps to shrink a graph into a much smaller graph. After partitioning, the graph is rebuilt to restore the original graph. A single global priority queue is used for all types of moves. The Spectral Partitioning Algorithm finds splitting of the values such that the vertices in a graph can be partitioned with respect to the evaluation of the Fiedler vector.
Experiments are conducted by us to compare the three algorithms. The datasets provided by Christopher Walshaw at the University of Greenwich [9] are used. They are chosen because they are, on the one hand, large enough for us to study scalability and, on the other hand, manageable by our machine.
The datasets are desired to be as large as possible while the partitioning algorithms can finish in a reasonable time on the tested machine. We also run experiments on complete graphs with 30 and 300 nodes. Each dataset is tested four rounds with the number of partitions (k) being 2, 4, 8, and 16.
As shown in Fig. 1, the MLkP algorithm has the highest speed in general. It is 560 times faster than the spectral partitioning algorithm and even faster than the recursive KernighanLin algorithm. The spectral partitioning algorithm has, in general, the best partition quality. It is 1.3 times better than MLkP and much better than the recursive KernighanLin algorithm. The recursive KernighanLin algorithm takes too long to complete all five datasets. It also shows serious scalability issues for complete graphs.
Considering that the MLkP algorithm has the best overall performance, we choose to use this algorithm for graph partitioning in our algorithm.

1)
Applications Related to Association Rule Learning Algorithms on Microarray Data
Both Apriori and FPGrowth algorithms have been applied to microarray datasets [10], according to [10]. The main challenge for applying an existing association rule algorithm to a microarray dataset is the large number of items per transaction. Almost all microarray datasets have significantly more genes than assays. The existing approaches transpose the microarray dataset into transactional datasets. After transposing the dataset, we have significantly more columns than rows. This greatly increases the complexity of the existing association rule algorithms because they are designed for datasets with more rows than columns. With the existing algorithms, it can be shown that the FPGrowth algorithm performs slightly (around 10%) better than the Apriori algorithm. However, the author points out that a more scalable algorithm is needed to overcome the time and space complexities.
There are variations of association rule learning algorithms on microarray datasets. The FARMER algorithm [11] finds interesting rule groups instead of individual rules. The algorithm is efficient for finding some association rules between genes and labels.
Huang et al. propose a ternary discretization approach [12] that converts each gene expression level to one of the three levels: underexpressed, normal, and overexpressed. Compared to traditional binary classification methods, the ternary discretization approach captures the overall gene expression distribution to prevent serious information loss.
In summary, the existing variations of a traditional approach such as Apriori or FPGrowth algorithms have issues related to scalability or coverage. The algorithms and heuristics reported in this paper tolerate"" certain accuracy for better scalability so the investigator may navigate a dataset iteratively (We call it "iterative investigation" in this paper) to converge in the search process quickly.

2)
Other Data Mining Algorithms on Microarray Data
In addition to association rule learning algorithms, other data mining algorithms that address different problems have also been applied to microarray data.
Many researchers have studied classification problems on microarray data. The most popular application is classifying diseases based on gene expression levels [13]. Many algorithms have been applied to solve classification problems. The most studied algorithms include the Bayesian network [14], Support Vector Machine [15], and kNearest Neighbor [16].
These problems usually take gene expression levels as the input (features) and predict the disease(s) associated with an assay. It can also be used to classify tumors based on gene expression levels.
Our solution
Data preprocessing
Dataset reduction
It may not be very useful for a large dataset with hundreds of thousands of genes to find rules that cover all the genes. Since we may be only interested in overexpressed and underexpressed genes and all diseases, normally expressed genes could be eliminated from our dataset. We can also adjust the threshold of overexpressed and underexpressed genes to classify fewer genes as over/underexpressed ones.
A common approach is the following method that converts the gene expression levels into logscale values [17].
First, we arbitrarily pick a reference assay and calculate the relative expression levels based on the reference assay. Assuming the absolute gene expression levels of the reference assay is \({E}_{r1}, {E}_{r2}\dots {E}_{rn}\), we can calculate the relative gene expression levels for another assay A as: \({R}_{A1}, {R}_{A2}\dots {R}_{An}=\frac{{E}_{r1}}{{E}_{A1}}, \frac{{E}_{r2}}{{E}_{A2}}\dots \frac{{E}_{rn}}{{E}_{An}}\) where \({E}_{A1}, {E}_{A2}\dots {E}_{An}\) are absolute gene expression levels for assay A. We can use the above method to calculate the relative gene expression levels for all other assays.
Next, the relative gene expression levels are used to find the logscale gene expression levels. For each assay A, the logscale gene expression levels are calculated as:
In the end, a userdefined threshold \(h\) is used to filter out some normally expressed expression levels. A lower \(h\) value means more gene expression levels are kept, and the computation time is longer. This step can dramatically reduce the size of the dataset while keeping valuable information.
Converting into transactional datasets
Microarray datasets are matrices of data. Each row of a matrix represents a gene, while each column represents an assay. However, to perform association rule learning, we need to convert microarray datasets into transactional datasets. Each row is an assay in a transactional dataset, and each "transaction" has a different number of genes.
Our algorithm transposes the matrix that we obtain from the earlier steps. Next, each logscale gene expression level is converted into a ternary item [12]. If the level exceeds the positive threshold, an item \(+G\) replaces the corresponding expression level where G is the gene number. Likewise, if the logscale expression level is less than the negative threshold, it will be replaced by \(G\).
For example, if we have an assay that has genes G1, G2, and G3 with logscale expression levels {100, 10, 300}, respectively. Assuming the thresholds are 50 and + 50, we convert the expression levels into {G1, + G3}. G2 is not included in the above transaction because its expression level is not significant.
Extracting disease information
We introduce disease information to the transactional dataset so that our association rule learning algorithm can derive genedisease association rules. The prior approaches do not address disease information. To find association rules that involve genes and diseases, we need to convert the disease information associated with each assay into an item.
First, the disease information is extracted from the sample information. A disease, in this case, can be a specific disease or "normal." If the disease information is provided, our algorithm will copy the disease name as an item name to the corresponding assay (transaction). Therefore, for each transaction, there are one or more gene items and a disease item.
For example, an assay is labeled as "Tumor" in the original dataset and calculated in the above steps to have items {−G1, + G3}. The algorithm will add "Tumor" to the transaction to have {−G1, + G3, "Tumor"}.
Calculating gene importance
An important aspect of association rule ranking is evaluating the importance of each gene. The following is our approach for calculating gene importance, of which the importance of a gene can be viewed as the average degree of over/under expression in the dataset. We also want to consider \(+G\) and \(G\) individually since they are considered different items in the transformed dataset. We define the gene importance for gene g, \({E}_{g}\), as below:
In the above, \({K}_{j}\) is the gene expression level of gene j, \({K}_{g}\) is the gene expression level for gene g. The first part, \(\frac{\sum_{j= 1}^{m}{K}_{j}}{m}\), calculates the average expression level of all genes, and the second part, \(\frac{\sum_{i=1}^{t}{K}_{g}}{t}\), finds the average expression level of gene g. The difference between the two is the deviation for gene g. If the deviation is high, the expression level of a gene is outstanding, and we can say it is important.
For example, if the average gene expression level of all the genes is 20, and we calculate that gene G1 has an average expression level of 100, while G2 has an average expression level of 2. Then \({E}_{g}\) for G1 and G2 are 80 and 18, respectively. Therefore, G1 should be ranked above G2.
The SARLheuristic
Definitions
The following definitions are used in this paper:

1)
Kitemset: an itemset with k items

2)
Support: number of occurrences of an itemset in the dataset

3)
Minsup: the minimum requirement of support. The user usually provides this. Itemsets with support < minsup are eliminated.

4)
Confidence: the indication of robustness of a rule in terms of percentage. Confidence(XY) = support(\(X\cup Y\))/support(X)

5)
Minconf: the minimum requirement of confidence. The user usually provides this. Rules with confidence < minconf are eliminated.

6)
ItemAssociation Graph: a graph structure that stores the frequent associations between pairs of items.

7)
Balanced Kway Graph Partitioning Problem: Divide the nodes of a graph into k parts such that each part has almost the same number of nodes while minimizing the number of edges/sum of the edge weights that are cut off.

8)
\({E}_{g}\): importance of gene g.

9)
\({I}_{r}:\text{Importance of rule r}.\)
A scalable heuristic algorithm—SARLheuristic
The following is an outline of our scalable heuristic [18].

Step 1: Find frequent one and two itemsets using the Apriori algorithm (when minsup is high) or the direct generation method (when minsup is low).

Step 2: Construct the item association graph (IAG) from the result of step 1.

Step 3: Partition the IAG using the multilevel kway partitioning algorithm (MLkP).

Step 4: Partition the dataset according to the result of step 3.

Step 5: Call the modified Apriori algorithm or the FPGrowth algorithm to mine frequent itemsets on each transaction partition.

Step 6: Find the union of the results found from each partition.

Step 7: Generate association rules by running the Aprioriapgenrules on the frequent itemsets found from step 6.
An example
Suppose the microarray dataset in Table 1 is given, and minsup is set to 0.2 (or 20%, or \(8*0.2\approx 2\) occurrences), and minconf is set to 0.7 (or 70%). We select assay 8 as the reference assay then calculate the relative expression levels. The results are shown in Table 2.
Next, we calculate the logscale gene expression levels by taking log base 2: \(\mathrm{log}\left(x, 2\right)\) where x is the relative expression level. The results are shown in Table 3.
Then we normalize the expression levels by applying a threshold to the logscale expression levels. Here, we choose the threshold to be 1, meaning all logscale expression levels that are above 1 or below −1 are set to 1 and −1, respectively. Levels between −1 and 1 are set to 0. The results are shown in Table 4.
Next, we transpose the matrix to prepare it for the transactional dataset. Each row is now an assay and each column is a gene. The results are shown in Table 5.
Finally, we convert the transposed matrix to a transactional dataset, shown in Table 6, each expression level that equals −1 or 1 is transformed into an item. In Table 6, the items of each transaction include the genes that are overexpressed (denoted by a + symbol) and genes that are underexpressed (denoted by a—symbol). For example, a transaction with TID T000 is an assay that contains three significantly (over or under) expressed genes, gene 1 (underexpressed), gene 2 (overexpressed), and gene 5 (overexpressed).
Now, we use the Apriori algorithm to find the frequent two itemsets. As an intermediate step, the Apriori algorithm finds the frequent oneitemset first (shown in Table 7):
The frequent twoitemsets are found afterward (shown in Table 8):
Next, we transform the above frequent twoitemsets into an item association graph (IAG), shown in Fig. 2:
To construct the graph, we first take the itemset {–1, + 2} with support 3. For this, we create node −1 and node + 2 corresponding to the two items in the itemset. The edge between node −1 and node + 2 has weight 3, representing the support of the itemset. The process is repeated for every frequent twoitemset found in the previous step.
Subsequently, we use the multilevel kway partitioning algorithm (MLkP) to partition the IAG. In this case, the number of nodes is small, so we only bisect the graph by setting k = 2. The result is shown in Figs. 3 and 4.
The MLkP algorithm divides the IAG into two equal or almost equal sets in linear time while the sum of the weights of edges that are cut off is the minimum.
Next, we partition the dataset according to the partitions of the IAG, as shown in Tables 9 and 10. Each transaction partition has all the items from the corresponding IAG partition. However, since the algorithm has already found all the frequent one and two itemsets, a transaction is not added to a transaction partition if the transaction has less than three items. For example, T000: {−1, + 2} is not added to the transaction partition 1, since it only has two items. Some items in the original dataset may not appear in any of the transaction partitions, because the infrequent one/twoitemsets are dropped in the IAG. This simplifies the subsequent computations. In this example, however, all the items are kept in the IAG because the IAG is a relatively dense graph. Tables 11 and 12 show the transaction partitions:
The next step is to pick the best algorithm and use it to find the frequent kitemsets with k > 2. For this example, we choose the modified Apriori algorithm because it is faster for mining small datasets as it avoids the process of finding the one and twoitemsets again. The results from partition 1 are shown in Table 11:
Since the modified Apriori algorithm starts with threeitemsets, there are no additional frequent itemsets in the first partition. Table 12 shows the results found in transaction partition 2:
The final results (shown in Table 13) of frequent itemsets are simply the union of Tables 7, 8, and 11:
After running the Aprioriapgenrules algorithm, the association rules can be found in Table 14.
All the frequent itemsets generated by the SARL heuristic are sound, meaning each frequent itemset generated indeed is correct, and the support number is accurate. However, it is possible that some frequent itemsets cannot be found by the SARL heuristic, as will be discussed shortly. In this example, the SARL heuristic loses one frequent itemset {−1, −4, −5} and two related rules generated from {−1, −4, −5}.
The SARL (scalable association rule learning) heuristic
We introduced the SARL heuristic in our previous paper [18]. SARL is a highly scalable heuristic algorithm for association rule learning problems on horizontal transactional datasets. In this paper, a modified version of SARL serves as the core of our algorithm. A summary of the SARL heuristic is shown below. A more detailed and formal description, including the pseudocode, can be found in the paper that introduces SARL.
The Apriori algorithm or the direct counting and generation algorithm is used to generate frequent one and two itemsets, depending on the size of the dataset. Apriori is faster on very large datasets, where the direct counting and generation algorithm is faster on small and mediumsized datasets. SARL then builds the item association graph (IAG) based on the frequent two itemsets. Each frequent two itemset is converted into an edge on the IAG, and each item in the itemset is converted into a node. Then, the MLkP algorithm is used to partition the IAG into k subgraphs. The dataset is partitioned based on the subgraphs. Each partition of the dataset should contain all the items (nodes) of a subgraph of the IAG across all the transactions of the datasets. During this process, some transactions may end up undivided, and all the possible frequent itemsets related to these transactions will be preserved in later stages. Next, the Apriori algorithm or the FPGrowth algorithm is selected based on an analysis of the dataset to ensure the most efficient execution. Finally, SARL calls the selected algorithm on each dataset partition to complete the computation. If the Apriori algorithm is selected, SARL will call the modified Apriori that starts from the frequent three itemsets computation to avoid any redundant work.
The SARL heuristic divides the dataset into k partitions. The size of each partition should be smaller than \(\frac{1}{k} \times\) size of the dataset because the dataset is partitioned according to IAG, and the number of items (nodes) in the IAG should be smaller than the number of unique items in the dataset. A more detailed explanation can be found in the Transaction Partitioning section of the Appendix section. This indicates that each dataset partition can always fit into the memory. All later steps of the SARL heuristic significantly benefit from processing the dataset in the memory rather than on the disk.
We also conducted a thorough time and space complexity analysis and an errorbound analysis that can be found in the Appendix section of this paper.
Ranking of association rules
Considering the nature of microarray datasets, the number of unique items (genes) is usually large. This leads to a tremendous number of association rules. Therefore, it is necessary to rank the association rules by their importance so that the results can be easily used. The goal of this study aims to help scientists explore and validate new association rules more efficiently.
To achieve the goal, we introduce the following measurement of the importance of rule r: x y:
where \({L}_{r}=\frac{conf(r)}{\mathrm{sup}(r)}\), and \({E}_{g}= \sqrt{\frac{{\sum_{j=1}^{m}\left({K}_{j}{K}_{g}\right)}^{2}}{m}}\)
In the above, \({I}_{r}\) is the importance of the rule, \({L}_{r}\) is the lift of the rule \(r\), n is the number of unique genes included in rule r, \({E}_{gr}\) is the RMS deviation of the expression level of gene g that is included in rule r, \({K}_{j}\) is the gene expression level of gene j, where j represents all other genes; \({K}_{g}\) is the gene expression level for gene g, and \({B}_{r}\) is the bias applied to this rule. The bias should be positive if a disease is in the rule.
The intuition here is to emphasize three factors that are related to the importance of an association rule. The first is the lift of a rule [19]. A higher lift indicates the rule has a higher response compared to the other rules. If the lift value is large, then the antecedent and the consequent of the rule are more dependent on each other, and this further indicates a high significance of the rule. The second factor is the average significance of each gene included in the rule. If all or most of the genes are significant, then the rule is likely to be more important. When we convert the microarray dataset into a transactional dataset, the absolute gene expression levels are converted into relative expression levels, and some information related to the absolute levels is missing. Here, we reconsider the influence of the absolute gene expression levels and calculate the average significance of a genebased on it. \({E}_{g}\), the deviation of the average absolute gene expression level, is calculated by taking the difference of the average absolute gene expression levels of all genes and the average absolute gene expression level of gene g. The significance of a rule contributed by its genes is then calculated by taking the average of each \({E}_{g}\) that is included in rule r.
For example, assuming the rule \(G1\to G2\) is an association rule found by the SARL heuristic. Genes 1, 2, and 3 have expression levels of 10, 8, and 5, respectively. The rule has confidence of 0.7 and support of 10. Then we can calculate \({L}_{r}=\frac{conf(r)}{\mathrm{sup}(y)}=0.07\), \({E}_{1}= \sqrt{\frac{{\left({K}_{2}{K}_{1}\right)}^{2}+{\left({K}_{3}{K}_{1}\right)}^{2}}{2}}\) = 3.8, \({E}_{2}= \sqrt{\frac{{\left({K}_{3}{K}_{2}\right)}^{2}+{\left({K}_{1}{K}_{2}\right)}^{2}}{2}}\) = 2.5. Since the rule does not involve a disease, bias is set to 0. Hence, \({I}_{r}={L}_{r}\times \left(\frac{\sum {E}_{gr}}{n}+{B}_{r}\right)=0.07 \times \left(\frac{3.8+2.5}{2}+0\right)=0.22\).
We incorporate the three most important measurements in the ranking of the rules. The lift measurement generally addresses the ranking of the significance of each rule, the average gene significance traces back to the microarray dataset and considers the importance of each gene, and the bias \({B}_{r}\) highlights the rules that involve disease information.
Experiments and results
We have designed and conducted experiments on small and large microarray datasets to demonstrate the scalability and accuracy of our algorithm. The experiments are based on the following configuration:

OS: macOS Big Sur

CPU: Apple M1

Memory: 8 GB

Disk: 256 GB, SSD

Programming Language: Python 3.7
All three datasets are downloaded from ArrayExpress [20]. We test the SARL algorithm on each of the datasets with various minsup configurations. Here, minsup refers to the minimum number of occurrences rather than the percentage of that.
EMTAB9030—microRNA profiling of muscular dystrophies dataset
The dataset has the following metrics:

File size: 4 KB

Number of genes: 29

Number of assays: 15
This is a relatively small dataset. The experiments are done repeatedly for minsup of 5, 4, 3, 2. The results are shown in Table 15.
According to Fig. 5, the SARL heuristic runs faster than the Apriori algorithm on all minsup configurations. We can see the running time becomes larger as the minsup goes down. For the test case where minsup is 2, the SARL algorithm performs 26 times faster than the Apriori algorithm.
Figure 6 shows the accuracy of the SARL heuristic is between 0.62 and 0.67. The accuracy is calculated based on the 100 most important frequent itemsets. The results are 62% to 67% accurate based on the association rules derived by the Apriori algorithm with the 100 most important frequent itemsets. It seems the accuracy may be acceptable considering the purpose of this research and the speedup, i.e., to have a computational tool that can more quickly derive the important associations among genes for iterative investigation.
EMTAB8615
Molecular characterisation of TP53 mutated squamous cell carcinomas of the lung identifies BIRC5 as a putative target for therapy
The dataset has the following metrics:

File size: 73.4 MB

Number of genes: 58,202

Number of assays: 209
This dataset is larger than the previous one. Traditionally, finding association rules with the Apriori algorithm on the full dataset will take an extremely long time.
The result of the experiment is shown in Table 16.
According to Fig. 7, the SARL heuristic performs similarly comparing to the Apriori algorithm on a minsup range between 10 to 3, and both algorithms can finish within 2 s. This is because the SARL heuristic has a small overhead, and the size of the processed data is very small on these minsup configurations. However, when it comes to minsup = 2, the SARL heuristic outperforms the Apriori algorithm by a large margin. The SARL heuristic finished the task in less than 45 s comparing to 279 s for Apriori. Figure 8 shows that the SARL heuristic has 100% accuracy across all minsup configurations. We believe the SARL heuristic performs better than the Apriori algorithm overall on this dataset because it achieves the same goal with a fraction of time.
EMTAB6703—A microarray metadataset of breast cancer
The dataset has the following metrics:

File size: 780.2 MB

Number of genes: 20,546

Number of assays: 2302
This dataset is about ten times larger than the second dataset. However, based on the purpose of this research, we believe the total number of rules generated from the previous dataset is already overwhelmingly large. Therefore, we also reduced this dataset based on the method mentioned in this paper to speed up the calculation.
The experiment results are shown in Table 17.
From Fig. 9, we can find a similar performance result as the previous datasets, but the SARL performs better for the larger dataset. The SARL heuristic is 700 times faster than the Apriori algorithm on minsup = 2. More surprisingly, according to Fig. 10, SARL is accurate on all minsup configurations.
Discussion and conclusions
In this paper, we proposed a new algorithm for association rule learning specifically designed for microarray datasets. The SARL heuristic algorithm utilizes the ternary discretization method, divide and conquer paradigm, graph theory, and graph partitioning algorithm to significantly speed up the association rule learning process compared to traditional algorithms. The algorithm also shows space efficiency. The rule ranking algorithm based on the importance saves time for researchers by showing the most important rules first. The rules found and ranked by the SARL heuristic cover both intergenes rules and genedisease rules. We compared our algorithm with Apriori, the most commonly used association rule learning algorithm, through a series of experiments. The results show that our algorithm has a significant speedup while still maintains high accuracy.
Some potential drawbacks of our algorithm include: 1. There is a small probability that some nontrivial rules are lost in the dataset partitioning stage. 2. For small datasets or high support, the performance of our algorithm may be similar or slightly higher than the Apriori algorithm due to the overhead.
In the future, we plan to extend our work with the following tasks:

Develop a parallel version of the SARL heuristic and its implementation. The transaction partitions can be considered as independent datasets, and we can easily run the modified Apriori algorithm or FPGrowth algorithm on each of the transaction partition in parallel and then merge the results (frequent three or higher itemsets) together along with the frequent one and two itemsets to obtain the total frequent itemsets. Each parallel processor does not need to communicate with others during the computation since all the information needed is already included in the local dataset. This would result in maximum utilization of each processor.

A better algorithm may be used to predict the proper thresholds of the ternary discretization. The current ternary discretization is based on empirical methods and may need several tests to find the best thresholds that reduce the dataset to a smaller size while keeping enough information. A statistical analysis of the dataset may help to decide the boundaries. Furthermore, we should consider incorporating deep neural network approach in this process to predict the best threshold.

Incremental Learning on Multiple Datasets: Nowadays, new microarray datasets are added to databases around the world on a daily basis. Among them, some of them focus on the same sets of genes, and others may have overlapping gene components. This brings an interesting question: can we learn association rules across multiple microarray datasets to get a larger number of more convincing rules? The answer is yes. It is possible and quite useful to learn association rules from multiple datasets. In fact, a prominent advantage of the SARL algorithm is the ability to do incremental learning across multiple datasets.
Assume we already examined and ran the SARL algorithm to learn association rules on datasets A, which includes genes G1, G2, and G3. Now, a dataset B is added with genes G1, G2, G3, and G4. We can extend the association rules from dataset A on G1, G2, and G3 in dataset B. G4 is removed from B since we cannot associate G4 to dataset A. Firstly, we compare the reference conditions (assays) between datasets A and B and find a coefficient for each gene expression level:

\(c1=\frac{A1}{B1}\)

\(c2=\frac{A2}{B2}\)

\(c3=\frac{A3}{B3}\)
A1 through A3 are expression levels of reference condition in dataset A, B1 through B3 are expression levels of reference condition in dataset B. c1, c2, and c3 are coefficients we want to find.
Next, all expression levels in dataset B are divided by the corresponding coefficient:
where \({E}_{i}\) is gene expression level with gene number \(i\), and \({c}_{i}\) is the coefficient found in the previous step for gene \(i\). Now, expression levels in dataset B are adjusted for the differences in experimental conditions, we then are ready to run the SARL algorithm on dataset B.
The following is an example of combining two datasets:
According to Tables 18 and 19, assuming Assay2 is selected to be the reference condition in both datasets. We may calculate c1, c2, and c3 as:
c1 = 7/2 = 3.5
c2 = 2/5 = 0.4
c3 = 1/7 = 0.14
After dividing all the expression levels in dataset B by the corresponding coefficient we have a combined dataset (shown in Table 20):
The process of learning association rules on datasets A and B combined is simple. We run the SARL algorithm on the normalized dataset B until all support values are found. We can then merge the support values found for dataset B with the support values found for dataset A. After eliminating infrequent itemsets based on the new minsup value, we can generate the association rules.
Availability of data and materials
The datasets analyzed during the current study are available in ArrayExpress [https://www.ebi.ac.uk/arrayexpress/] and The Graph Partitioning Archive[https://chriswalshaw.co.uk/partition].
Abbreviations
 Kitemset:

An itemset with k items
 Minsup:

The minimum requirement of support. The user usually provides this. Itemsets with support < minsup are eliminated.
 Minconf:

The minimum requirement of confidence. The user usually provides this. Rules with confidence < minconf are eliminated.
 IAG:

ItemAssociation Graph
 MLkP:

Multilevel kway Partitioning algorithm
 SARL:

Scalable Association Rule Learning
 E _{ g } :

Importance of gene g.
 I_{r} :

Importance of rule r
References
Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215; 1994, p. 487–99.
Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 2000;29(2):1–12.
Buluç A, Meyerhenke H, Safro I, Sanders P, Schulz C. Recent advances in graph partitioning. In: Algorithm engineering. Cham: Springer; 2016, p. 117–58.
Kernighan BW, Lin S. An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J. 1970;49(2):291–307.
Karypis G, Kumar V. Multilevelkway partitioning scheme for irregular graphs. J Parallel Distrib Comput. 1998;48(1):96–129.
McSherry F. Spectral partitioning of random graphs. In: Proceedings 42nd IEEE symposium on foundations of computer science. IEEE; 2001, p. 529–37.
Galinier P, Boujbel Z, Fernandes MC. An efficient memetic algorithm for the graph partitioning problem. Ann Oper Res. 2011;191(1):1–22.
Sanders P, Schulz C. Engineering multilevel graph partitioning algorithms. In European symposium on algorithms. Berlin, Heidelberg: Springer; 2011, p. 469–80.
Walshal C. The graph partitioning archive; 2020. https://chriswalshaw.co.uk/partition/.
Alagukumar S, Lawrance R. A selective analysis of microarray data using association rule mining. Procedia Comput Sci. 2015;47:3–12.
Cong, G., Tung, A. K., Xu, X., Pan, F., & Yang, J. (2004, June). Farmer: Finding interesting rule groups in microarray datasets. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (pp. 143–154).
Huang Z, Li J, Su H, Watts GS, Chen H. Largescale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining. Decis Support Syst. 2007;43(4):1207–25.
Dudoit S, Fridly J. Introduction to classification in microarray experiments. In: A practical approach to microarray data analysis. Boston: Springer; 2003, p. 132–49.
Zhang, B. T., & Hwang, K. B. (2003). Bayesian network classifiers for gene expression analysis. In A practical approach to microarray data analysis (pp. 150–165). Springer, Boston, MA.
Mukherjee S. Classifying microarray data using support vector machines. In: A practical approach to microarray data analysis. Boston: Springer; 2003, p. 166–85.
Li L, Weinberg CR. Gene selection and sample classification using a genetic algorithm and knearest neighbor method. In: A practical approach to microarray data analysis. Boston: Springer; 2003, p. 216–29.
Quackenbush J. Microarray data normalization and transformation. Nat Genet. 2002;32(4):496–501.
Li H, Sheu PCY. A scalable association rule learning heuristic for large datasets. J Big Data. 2021;8(1):1–32.
McNicholas PD, Murphy TB, O’Regan M. Standardising the lift of an association rule. Comput Stat Data Anal. 2008;52(10):4712–21.
Athar A, et al. ArrayExpress update—from bulk to singlecell expression data. 2019. Nucleic Acids Res. https://doi.org/10.1093/nar/gky964.PubmedID30357387.
Acknowledgements
Not applicable
Funding
Not applicable.
Author information
Affiliations
Contributions
Both HL and PS are major contributors to writing the manuscript. HL and PS have made substantial contributions to the conception and design of work. HL has created the software used in this paper. Both HL and PS have approved the submitted version and have agreed on both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
The appendix section is intended to provide a brief overview of the SARL heuristic [18].
Theorems
The followings are the theorems we proposed in our SARL heuristic paper [18]. The proofs are available in the same refenced paper.

Theorem 1: Soundness—all frequent itemsets and association rules generated by the SARL heuristic are correct.

Theorem 2: Computing the frequent two itemsets is considered relatively trivial compared to computing the frequent three or more itemsets.

Theorem 3: Consider a value of minsup such that the fraction of frequent one itemset over the total number of unique items, d, denoted by f, is less than (1—the maximum imbalance rate), where the maximum imbalance rate is usually set to 3% based on the MLkP algorithm. If the partition by MLkP is kway, then each partition contains less than d/k unique items, where d is the total unique items in the original dataset. As a consequence, the complexity of each partition can be reduced.
Finding frequent 2 itemsets using the apriori algorithm or dirct_gen algorithm
The first step of the SARL heuristic is to find the frequent 2 itemsets efficiently.
Although the Apriori algorithm has scalability issues for very large datasets, it provides a fast and convenient feature to extract intermediate results and a tolerable speed for the first two passes.
Another method to find frequent one and two itemsets is through direct counting and generation. The algorithm to find frequent one itemsets is the same as the Apriori algorithm. To find frequent two itemsets, we can simply find all twoitem pairs in each transaction and count the occurrence of them. The advantage of this algorithm is that it does not require candidate generation from L1, and avoids many unnecessary membership testing during support counting. However, this method is not efficient on large datasets since it does not use pruning and saves all two itemsets.
In the SARL heuristic, we ask the user for a threshold of the dataset size. If the dataset is larger than the threshold, the SARL heuristic will use the modified Apriori algorithm. Otherwise, it will use the direct_gen algorithm to compute the frequent one and two itemsets.
Construction of the item association graph
The item association graph G is constructed based on the two itemsets generated by the Apriori algorithm. G is an undirected, weighted graph. A node Vi is created for each unique item i in the two itemsets T with the maximum item number being n.
The edges E in graph G are formed for each itemset in T:
The weight of each edge \({E}_{ij}\) is equal to the support of itemset {i, j} in T:
Partition the IAG using the multilevel kway partitioning algorithm (MLkP)
The Multilevel kway partitioning (MLkP) algorithm [17] is an efficient graph partitioning algorithm. The time complexity is O(E), where E is the number of edges in the graph.
The general idea of MLkP is to shrink (coarsen) the original graph into a smaller graph, then partition the smaller graph using an improved version of the KL/FM algorithm. Lastly, it restores (uncoarsen) the partitioned graph to a larger, partitioned graph.
METIS is a software developed by Karypis at the University of Minnesota [18]. It includes an implementation of the MLkP algorithm that takes a graph as the input and outputs groups of nodes separated after the partition.
Transaction partitioning
Based on the results of the MLkP algorithm that divide the items into groups P_{1}, P_{2},…,P_{m}, we can partition the transactions into the same number of groups, where each group \({D}_{i}\) contains only the items in partition \({P}_{i}\). This guarantees that each partition fits into the memory. Refer to paper [18] for more details.
Selecting an algorithm on transaction partitions
One of the benefits that come with our solution is that the association rule learning on each transaction partition can be optimized by using an algorithm that best fits the partition.
Since the modified Apriori algorithm has already computed the one itemsets and two itemsets during the preparation phase, the candidate generation feature of the Apriori algorithm is handy in this case. We modify the Apriori algorithm to skip the frequent one/two itemsets finding stages and start with the frequent three itemsets from the transaction partitions. This modification is particularly helpful when minsup is set to a high value so that the expected number of itemsets is limited after the two itemsets are found.
The average transaction length provides a fast and straightforward reference for selecting the best algorithm for each transaction partition. The SARL heuristic choose between the modified Apriori algorithm and FPGrowth algorithm to complete the computation of frequent itemsets.
Time complexity and space complexity
The theoretical time and space complexity of the Apriori algorithm is \(O({2}^{d})\) where d is the number of unique items in the dataset.
Time complexity
If the modified Apriori algorithm is selected, the theoretical time complexity for each partition is \(O({2}^{1.03d/k})\) where the coefficient 1.03 comes from the 3% maximum imbalance of the partitions caused by the MLkP algorithm. The total running time for all the partitions is \(O\left(k*{2}^{\frac{1.03d}{k}}\right)=O({2}^{\frac{1.03d}{k}})\), and the total time complexity of the SARL algorithm, when the modified Apriori algorithm is selected, is \(O\left({d}^{2}T+n+{d}^{2}+{d}^{2}+n+{2}^{\frac{1.03d}{k}}\right)=O({d}^{2}T+n+{2}^{\frac{1.03d}{k}})\). Assume \(n "d\), and \({2}^{\frac{1.03d}{k}}"n\), the time complexity can be simplified to \(O({2}^{\frac{1.03d}{k}})\). Compared with the time complexity of the Apriori algorithm, the SARL is \(O\left(\frac{{2}^{d}}{{2}^{\frac{1.03d}{k}}}\right)=O({2}^{\frac{k1.03}{k}d})\) times faster than the Apriori algorithm. The exponential speed up comes from the smaller number of unique items in each transaction partition. The algorithm chosen to mine frequent itemsets from the transaction partitions only needs to consider a portion of all the items for each partition.
Space complexity
If the modified Apriori algorithm is selected, the theoretical space complexity for each partition is \(O\left({2}^{\frac{1.03d}{k}}\right),\) where the coefficient 1.03 comes from the default 3% maximum imbalance of partitions caused by the MLkP algorithm. The total space complexity for all partitions is therefore \(O\left(k*{2}^{\frac{1.03d}{k}}\right)=O({2}^{\frac{1.03d}{k}})\), and the total space complexity of the SARL heuristic, when the modified Apriori algorithm is selected, is \(O\left((31)*{d}^{2}+\frac{n}{k}+{2}^{\frac{1.03d}{k}}\right)=O({d}^{2}+\frac{n}{k}+{2}^{\frac{1.03d}{k}})\). Assume \(\frac{n}{k}"d\), and \({2}^{\frac{1.03d}{k}}"\frac{n}{k}\), the space complexity can be simplified to \(O({2}^{\frac{1.03d}{k}})\). Compared with the space complexity of the Apriori algorithm, SARL uses only \(O\left(\frac{{2}^{\frac{1.03d}{k}}}{{2}^{d}}\right)=O\left({2}^{\frac{1.03k}{k}d}\right)=o(\frac{1}{{2}^{\frac{k1.03}{k}d}})\) space comparing to the Apriori algorithm. The exponential reduction of space usage comes from the smaller number of unique items in each transaction partition. If the modified Apriori is chosen to mine frequent itemsets from the transaction partitions, it only generates a smaller number of candidates for each transaction partition, since it does not consider items in other partitions.
Error bound
The SARL heuristic sacrifices some precision to obtain the speed up. However, every frequent itemset found by the algorithm is correct, and the support associated with each frequent itemset is also correct. The heuristic may miss some trivial frequent itemsets, i.e., the itemsets with low support. During the IAG partition phase, the MLkP algorithm makes cuts on the IAG to minimize the sum of the weights of the edges that are cut off. This feature helps to prevent large weights from cut off, while some trivial, smallweight (support) edges may be lost.
We can make a rough estimation by introducing a parameter \({P}_{out}\), the ratio of the edges cut off in the IAG. \({P}_{out}=\frac{{E}_{cut}}{{E}_{total}}\). This parameter is determined by the characteristics of a dataset, the minsup choice, and the number of partitions we choose. \({P}_{out}\) is also a rough estimation of the error rate for the frequent two or more itemsets. Assume the ratio of the frequent two or more itemsets found is \({P}_{m}\), \({P}_{m}=\frac{\# frequent 2+ itemsets}{\# total frequent itemsets}\), then the total error bound can be computed as \(Erro{r}_{total}={P}_{m}*{P}_{out}\).
Benefits of having datasets fit into the memory
The transaction partitions are guaranteed to be small enough to fit into the memory. Therefore, any operations performed on these inmemory datasets should be faster than before. For example, the Apriori algorithm makes the number of passes on the dataset equal to the maximum length of frequent itemsets. Each of these passes requires reading the dataset from the disk. With our solution, the SARL heuristic makes at most two passes to the dataset. The first pass is to generate the frequent one and two itemsets, and in the second pass, the algorithm brings a fraction of the dataset into the memory. All further passes are made directly in the memory, resulting in speedup.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, H., Sheu, P.CY. A scalable association rule learning and recommendation algorithm for largescale microarray datasets. J Big Data 9, 35 (2022). https://doi.org/10.1186/s40537022005774
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537022005774
Keywords
 Association rule learning
 Microarray dataset
 Frequent itemset mining
 Scalability
 Graph partitioning
 Apriori algorithm