A scalable association rule learning and recommendation algorithm for large-scale microarray datasets

Li, Haosong; Sheu, Phillip C.-Y.

doi:10.1186/s40537-022-00577-4

Research
Open access
Published: 28 March 2022

A scalable association rule learning and recommendation algorithm for large-scale microarray datasets

Haosong Li¹ &
Phillip C.-Y. Sheu¹

Journal of Big Data volume 9, Article number: 35 (2022) Cite this article

2958 Accesses
4 Citations
3 Altmetric
Metrics details

Abstract

Association rule learning algorithms have been applied to microarray datasets to find association rules among genes. With the development of microarray technology, larger datasets have been generated recently that challenge the current association rule learning algorithms. Specifically, the large number of items per transaction significantly increases the running time and memory consumption of such tasks. In this paper, we propose the Scalable Association Rule Learning (SARL) heuristic that efficiently learns gene-disease association rules and gene–gene association rules from large-scale microarray datasets. The rules are ranked based on their importance. Our experiments show the SARL algorithm outperforms the Apriori algorithm by one to three orders of magnitude.

Introduction

Microarray technology has been widely used in bioinformatics. It efficiently measures gene expression levels for a large number of genes. Therefore, a huge amount of data can be generated from microarray datasets. Microarray datasets, after converting to transactional datasets, usually have a large number of columns (genes) and a small number of rows (assays). Since the time complexity of any precise association rule learning algorithm is ${2}^{d}$, where d is the number of unique items (genes, in this case), such a large number of genes causes a huge challenge for all existing association rule learning algorithms. For a large microarray dataset, it is impractical to apply these algorithms to find all association rules (Fig. 1).

Researchers interested in deriving association rules from microarray datasets are most likely not to use every association rule. Furthermore, a large number of genes also results in an even higher number of association rules. The computer memory, which is limited compared to the disk space, can easily be used up to run a current association rule learning algorithm. With these in mind, we propose the Scalable Association Rule Learning (SARL) algorithm that focuses on the learning speed and the importance of rules derived.

As more microarray datasets are generated every day, investigators seeking potential associations between genes and between genes and diseases need a tool to find candidate rules across multiple datasets quickly. SARL is such a tool that provides scalable association rule learning and rule ranking. After having a general idea of candidate rules, investigators may choose to run a more time-costly algorithm that precisely calculate the rules on a few selected datasets. Therefore, by quickly reducing the scope of datasets and giving a general idea to the investigator, our algorithm can reduce the total time needed to find a target rule and increases the success rate.

Contributions

There are three main contributions in this paper:

1. The SARL heuristic with the divide and conquer methodology can increase the efficiency and scalability of association rule mining but still maintain reasonable accuracy.
2. The rule ranking algorithm calculates the importance and ranks the rules, so the investigator does not have to search through millions of rules.
3. We consider gene-disease associations. Each important association rule between genes and disease can be identified and highlighted in the result.

Related work

Several association rule learning algorithms have previously been applied to microarray datasets. The fundamental ones include the Apriori algorithm and the FP-Growth algorithm. Researchers have also created other algorithms or heuristics that find association rules with an approximation methodology.

1.
The Apriori Algorithm

The Apriori algorithm [1], introduced by Agrawal and Srikant, was the first efficient association rule learning algorithm. It incorporates various techniques to speed up the process as well as to reduce the use of memory. For example, the L_k-1 × L_k-1 method used in the candidate generation process can reduce the number of candidates generated, and the pruning process can significantly reduce the number of possible candidates at each level.

One of the most important mechanisms in the Apriori algorithm is the use of the hash tree data structure. It uses this data structure in the candidate support counting phase to reduce the time complexity from O(kmn) to O(kmT + n), where k is the average size of the candidate itemset, m represents the number of candidates, n represents the number of items in the whole dataset, and T is the number of transactions.

The major advantage of the Apriori algorithm comes from its memory usage because only the k-1 frequent itemsets, L_k-1, and the candidates in level k, C_k, need to be stored in the memory. It generates the minimum number of candidates based on the ${L}_{k-1}\times {L}_{k-1}$ (described in [1]) and the pruning method, and it stores them in the compact hash tree structure. In case the candidates fill up the memory from the dataset and a low minsup setting, the Apriori algorithm does not generate all the candidates to overload the memory. Instead, it generates as many candidates as the memory can hold.
2.
The FP-Growth Algorithm

The Frequent Pattern Growth algorithm was proposed by Han et al. in 2000 [2]. It uses a tree-like structure (called Frequent Pattern Tree) instead of the candidate generation method used in the Apriori algorithm to find the frequent itemsets. The candidate generation method finds the candidates of the frequent itemsets before reducing them to the actual frequent itemsets through support counting.

The algorithm first scans a dataset and finds the frequent one itemsets. Then, a frequent pattern tree is constructed by scanning the dataset again. The items are added to the tree in the order of their support. Once the tree is completed, the tree is traversed from the bottom, and a conditional FP-Tree is generated. Finally, the algorithm generates the frequent itemsets from the conditional FP-Tree.

The FP-Growth algorithm is more scalable than the Apriori algorithm in most cases since it makes fewer passes and does not require candidate generation. However, it suffers from memory limitations since the FP-Tree is relatively complex and may not fit in the memory. Traversing the complexed FP-Tree may also be time-expensive if the tree is not compact enough.
3.
Graph Partitioning Algorithms

One of the key steps in the SARL heuristic that we will introduce shortly is to partition the IAG (item association graph, section 7) into k balanced partitions. An efficient graph partitioning algorithm is crucial since the balanced graph partitioning problem is NP-complete [3]. We have implemented three algorithms and compared them for the partitioning costs and running times. They are the recursive version of the Kernighan-Lin Algorithm [4], the Multilevel k-way Partitioning Algorithm (MLkP) [5], and the recursive version of the Spectral Partitioning Algorithm [6]. Other graph partitioning algorithms include the Tabu search based MAGP algorithm [7] and the flow-based KaFFPa algorithm [8].

The Kernighan-Lin algorithm swaps the nodes assigned to both partitions and finds the largest decrease in the total cut size. The Multilevel k-way Partitioning algorithm (MLkP) uses coarsening-partitioning-uncoarsening/refining steps to shrink a graph into a much smaller graph. After partitioning, the graph is rebuilt to restore the original graph. A single global priority queue is used for all types of moves. The Spectral Partitioning Algorithm finds splitting of the values such that the vertices in a graph can be partitioned with respect to the evaluation of the Fiedler vector.

Experiments are conducted by us to compare the three algorithms. The datasets provided by Christopher Walshaw at the University of Greenwich [9] are used. They are chosen because they are, on the one hand, large enough for us to study scalability and, on the other hand, manageable by our machine.

The datasets are desired to be as large as possible while the partitioning algorithms can finish in a reasonable time on the tested machine. We also run experiments on complete graphs with 30 and 300 nodes. Each dataset is tested four rounds with the number of partitions (k) being 2, 4, 8, and 16.

As shown in Fig. 1, the MLkP algorithm has the highest speed in general. It is 560 times faster than the spectral partitioning algorithm and even faster than the recursive Kernighan-Lin algorithm. The spectral partitioning algorithm has, in general, the best partition quality. It is 1.3 times better than MLkP and much better than the recursive Kernighan-Lin algorithm. The recursive Kernighan-Lin algorithm takes too long to complete all five datasets. It also shows serious scalability issues for complete graphs.

Considering that the MLkP algorithm has the best overall performance, we choose to use this algorithm for graph partitioning in our algorithm.

1)
Applications Related to Association Rule Learning Algorithms on Microarray Data

Both Apriori and FP-Growth algorithms have been applied to microarray datasets [10], according to [10]. The main challenge for applying an existing association rule algorithm to a microarray dataset is the large number of items per transaction. Almost all microarray datasets have significantly more genes than assays. The existing approaches transpose the microarray dataset into transactional datasets. After transposing the dataset, we have significantly more columns than rows. This greatly increases the complexity of the existing association rule algorithms because they are designed for datasets with more rows than columns. With the existing algorithms, it can be shown that the FP-Growth algorithm performs slightly (around 10%) better than the Apriori algorithm. However, the author points out that a more scalable algorithm is needed to overcome the time and space complexities.

There are variations of association rule learning algorithms on microarray datasets. The FARMER algorithm [11] finds interesting rule groups instead of individual rules. The algorithm is efficient for finding some association rules between genes and labels.

Huang et al. propose a ternary discretization approach [12] that converts each gene expression level to one of the three levels: under-expressed, normal, and over-expressed. Compared to traditional binary classification methods, the ternary discretization approach captures the overall gene expression distribution to prevent serious information loss.

In summary, the existing variations of a traditional approach such as Apriori or FP-Growth algorithms have issues related to scalability or coverage. The algorithms and heuristics reported in this paper tolerate"" certain accuracy for better scalability so the investigator may navigate a dataset iteratively (We call it "iterative investigation" in this paper) to converge in the search process quickly.
2)
Other Data Mining Algorithms on Microarray Data

In addition to association rule learning algorithms, other data mining algorithms that address different problems have also been applied to microarray data.

Many researchers have studied classification problems on microarray data. The most popular application is classifying diseases based on gene expression levels [13]. Many algorithms have been applied to solve classification problems. The most studied algorithms include the Bayesian network [14], Support Vector Machine [15], and k-Nearest Neighbor [16].

These problems usually take gene expression levels as the input (features) and predict the disease(s) associated with an assay. It can also be used to classify tumors based on gene expression levels.

Our solution

Data preprocessing

Dataset reduction

It may not be very useful for a large dataset with hundreds of thousands of genes to find rules that cover all the genes. Since we may be only interested in over-expressed and under-expressed genes and all diseases, normally expressed genes could be eliminated from our dataset. We can also adjust the threshold of over-expressed and under-expressed genes to classify fewer genes as over/under-expressed ones.

A common approach is the following method that converts the gene expression levels into log-scale values [17].

First, we arbitrarily pick a reference assay and calculate the relative expression levels based on the reference assay. Assuming the absolute gene expression levels of the reference assay is ${E}_{r1}, {E}_{r2}\dots {E}_{rn}$, we can calculate the relative gene expression levels for another assay A as: ${R}_{A1}, {R}_{A2}\dots {R}_{An}=\frac{{E}_{r1}}{{E}_{A1}}, \frac{{E}_{r2}}{{E}_{A2}}\dots \frac{{E}_{rn}}{{E}_{An}}$ where ${E}_{A1}, {E}_{A2}\dots {E}_{An}$ are absolute gene expression levels for assay A. We can use the above method to calculate the relative gene expression levels for all other assays.

Next, the relative gene expression levels are used to find the log-scale gene expression levels. For each assay A, the log-scale gene expression levels are calculated as:

$${L}_{A1}, {L}_{A2}\dots {L}_{An}={\mathrm{log}}_{2}{R}_{A1}, {\mathrm{log}}_{2}{R}_{A2}\dots {\mathrm{log}}_{2}{R}_{An}$$

In the end, a user-defined threshold $h$ is used to filter out some normally expressed expression levels. A lower $h$ value means more gene expression levels are kept, and the computation time is longer. This step can dramatically reduce the size of the dataset while keeping valuable information.

Converting into transactional datasets

Microarray datasets are matrices of data. Each row of a matrix represents a gene, while each column represents an assay. However, to perform association rule learning, we need to convert microarray datasets into transactional datasets. Each row is an assay in a transactional dataset, and each "transaction" has a different number of genes.

Our algorithm transposes the matrix that we obtain from the earlier steps. Next, each log-scale gene expression level is converted into a ternary item [12]. If the level exceeds the positive threshold, an item $+G$ replaces the corresponding expression level where G is the gene number. Likewise, if the log-scale expression level is less than the negative threshold, it will be replaced by $-G$.

For example, if we have an assay that has genes G1, G2, and G3 with log-scale expression levels {-100, 10, 300}, respectively. Assuming the thresholds are -50 and + 50, we convert the expression levels into {-G1, + G3}. G2 is not included in the above transaction because its expression level is not significant.

Extracting disease information

We introduce disease information to the transactional dataset so that our association rule learning algorithm can derive gene-disease association rules. The prior approaches do not address disease information. To find association rules that involve genes and diseases, we need to convert the disease information associated with each assay into an item.

First, the disease information is extracted from the sample information. A disease, in this case, can be a specific disease or "normal." If the disease information is provided, our algorithm will copy the disease name as an item name to the corresponding assay (transaction). Therefore, for each transaction, there are one or more gene items and a disease item.

For example, an assay is labeled as "Tumor" in the original dataset and calculated in the above steps to have items {−G1, + G3}. The algorithm will add "Tumor" to the transaction to have {−G1, + G3, "Tumor"}.

Calculating gene importance

An important aspect of association rule ranking is evaluating the importance of each gene. The following is our approach for calculating gene importance, of which the importance of a gene can be viewed as the average degree of over/under expression in the dataset. We also want to consider $+G$ and $-G$ individually since they are considered different items in the transformed dataset. We define the gene importance for gene g, ${E}_{g}$, as below:

$${E}_{g}=\left|\frac{\sum_{j= 1}^{m}{K}_{j}}{m}-\frac{\sum_{i=1}^{t}{K}_{g}}{t}\right|$$

In the above, ${K}_{j}$ is the gene expression level of gene j, ${K}_{g}$ is the gene expression level for gene g. The first part, $\frac{\sum_{j= 1}^{m}{K}_{j}}{m}$, calculates the average expression level of all genes, and the second part, $\frac{\sum_{i=1}^{t}{K}_{g}}{t}$, finds the average expression level of gene g. The difference between the two is the deviation for gene g. If the deviation is high, the expression level of a gene is outstanding, and we can say it is important.

For example, if the average gene expression level of all the genes is 20, and we calculate that gene G1 has an average expression level of 100, while G2 has an average expression level of 2. Then ${E}_{g}$ for G1 and G2 are 80 and 18, respectively. Therefore, G1 should be ranked above G2.

The SARL-heuristic

Definitions

The following definitions are used in this paper:

1)
K-itemset: an itemset with k items
2)
Support: number of occurrences of an itemset in the dataset
3)
Minsup: the minimum requirement of support. The user usually provides this. Itemsets with support < minsup are eliminated.
4)
Confidence: the indication of robustness of a rule in terms of percentage. Confidence(XY) = support($X\cup Y$)/support(X)
5)
Minconf: the minimum requirement of confidence. The user usually provides this. Rules with confidence < minconf are eliminated.
6)
Item-Association Graph: a graph structure that stores the frequent associations between pairs of items.
7)
Balanced K-way Graph Partitioning Problem: Divide the nodes of a graph into k parts such that each part has almost the same number of nodes while minimizing the number of edges/sum of the edge weights that are cut off.
8)
${E}_{g}$: importance of gene g.
9)
${I}_{r}:\text{Importance of rule r}.$

A scalable heuristic algorithm—SARL-heuristic

The following is an outline of our scalable heuristic [18].

Step 1: Find frequent one and two itemsets using the Apriori algorithm (when minsup is high) or the direct generation method (when minsup is low).
Step 2: Construct the item association graph (IAG) from the result of step 1.
Step 3: Partition the IAG using the multilevel k-way partitioning algorithm (MLkP).
Step 4: Partition the dataset according to the result of step 3.
Step 5: Call the modified Apriori algorithm or the FP-Growth algorithm to mine frequent itemsets on each transaction partition.
Step 6: Find the union of the results found from each partition.
Step 7: Generate association rules by running the Apriori-ap-genrules on the frequent itemsets found from step 6.

An example

Suppose the microarray dataset in Table 1 is given, and minsup is set to 0.2 (or 20%, or $8*0.2\approx 2$ occurrences), and minconf is set to 0.7 (or 70%). We select assay 8 as the reference assay then calculate the relative expression levels. The results are shown in Table 2.

Table 1 Microarray dataset

A scalable association rule learning and recommendation algorithm for large-scale microarray datasets

Abstract

Introduction

Contributions

Related work

Our solution

Data preprocessing

Dataset reduction

Converting into transactional datasets

Extracting disease information

Calculating gene importance

The SARL-heuristic

Definitions

A scalable heuristic algorithm—SARL-heuristic

An example

The SARL (scalable association rule learning) heuristic

Ranking of association rules

Experiments and results

E-MTAB-9030—microRNA profiling of muscular dystrophies dataset

E-MTAB-8615

Molecular characterisation of TP53 mutated squamous cell carcinomas of the lung identifies BIRC5 as a putative target for therapy

E-MTAB-6703—A microarray meta-dataset of breast cancer

Discussion and conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendix

Appendix

Theorems

Finding frequent 2 itemsets using the apriori algorithm or dirct_gen algorithm

Construction of the item association graph

Partition the IAG using the multilevel k-way partitioning algorithm (MLkP)

Transaction partitioning

Selecting an algorithm on transaction partitions

Time complexity and space complexity

Time complexity

Space complexity

Error bound

Benefits of having datasets fit into the memory

Rights and permissions

About this article

Cite this article

Share this article

Keywords