Kavosh: an effective Map-Reduce-based association rule mining method

Barkhordari, Mohammadhossein; Niamanesh, Mahdi

doi:10.1186/s40537-018-0129-4

Research
Open access
Published: 14 July 2018

Kavosh: an effective Map-Reduce-based association rule mining method

Mohammadhossein Barkhordari¹ &
Mahdi Niamanesh¹

Journal of Big Data volume 5, Article number: 25 (2018) Cite this article

3646 Accesses
12 Citations
Metrics details

Abstract

The immense amount of data generated on a daily basis by various devices and systems necessitates a change in data analysis methods. As an important part of analytics, data mining methods require a paradigm shift to solve problems because the old methods cannot manage massive data. Association rule mining is a data mining algorithm used to solve various domain problems. Because of the immense volume of data, one-node solutions are no longer useful, and it is necessary to solve problems by using a distributed and shared-nothing architecture such as Map-Reduce. However, when association rule mining is transferred to these architectures, new problems appear. The main problems are lack of data locality and iteration support and process skewness. In this paper, a method is proposed that solves these problems. Kavosh converts data into a unified format that helps nodes perform their tasks independently without the need to exchange data with other nodes. In addition, the proposed method compresses input data to facilitate data management. Another advantage is the lack of process skewness because it is possible to allocate a predefined amount of data to each node. Kavosh omits iterations required for finding frequent itemsets by changing the Map-Reduce architecture. The proposed method is implemented using Hadoop, and the results are compared with open-source products in terms of three aspects: execution time, load balancing and data compression. The results show that Kavosh outperforms other methods in these aspects.

Introduction

With the growth of information, traditional analysis methods must be modified because they cannot handle immense amounts of data. Data mining algorithms are analytics that require a paradigm shift for algorithm execution and changes for deployment over nodes. Data mining algorithms must be modified so they can be executed over scalable and distributable environments. However, modifying data mining algorithms for distributed architecture is not easy. One problem with distributed architecture is data locality. This occurs when the required data for processing do not exist on the processor node.

One of the most prominent methods used to solve big data problems over shared-nothing architecture is Map-Reduce. Map-Reduce [1] is used in open-source solutions such as Hadoop and Spark. There are many products in addition to Hadoop and Spark that can be used to solve data mining problems. However, pure Map-Reduce suffers from the data locality problem and does not support iteration. Because nearly all data mining algorithms are iterative in nature, it is necessary to solve the iteration problem in Map-Reduce to solve data mining problems by using Map-Reduce-based methods.

Association rule mining is a data mining algorithm that requires iteration to find frequent itemsets. Methods such as Apriori [2] and FP-Growth [3] solve association rule mining problems. There are two main parameters for association rule mining: confidence and support. A rule in an association rule is defined as x → y (if x, then y), where x and y are itemsets. The support of a rule is defined as the frequency of an itemset in a database. The confidence of a rule is defined as

$$ Confidence\left( {x \to y} \right) = \frac{{{\text{Support}}\left( {{\text{x}} \cap {\text{y}}} \right)}}{{{\text{Support}}\left( {\text{x}} \right)}} $$

In this paper, Kavosh, which is a method for association rule mining, is proposed. This method is designed for a shared-nothing architecture and can properly solve the association rule mining problem in Map-Reduce.

This method solves the data locality problem completely. Thus, the nodes do not require data from other nodes. By changing and unifying the data format, data compression and data load balancing are supported. In addition, iteration is omitted in the proposed method; therefore, it is well-suited for the Map-Reduce architecture.

The remainder of this paper is structured as follows: In “Related works” section, related works are discussed. In the third section, Kavosh is presented. The fourth section describes an evaluation, and the final section presents the conclusions.

Related works

Map-Reduce is used to solve many problems over a shared-nothing architecture. This method is also used to solve association rule mining. There are four main problems with association rule mining on a shared-nothing architecture, namely, lack of support for the following:

Data locality,
Iteration,
Load balancing, and
Data compression.

As mentioned above, the data locality problem is a lack of required data on the processor node, which creates data dependency among nodes. This problem causes network congestion and increases the algorithm execution time. Another problem is lack of iteration support. When intermediate results are created by the Reducers, they must be fed as input to the Mappers for another iteration. This problem also increases the execution time because intermediate data must be written on the disk by the Reducers and read by the Mappers to initiate another iteration. The third problem is the load balancing problem. This occurs when the processes are not allocated to the Mappers fairly. This problem reduces the algorithm speed because all nodes must wait for a busy node after job completion. The last problem is lack of support for data compression. Because of the large amount of data, data volume reduction is necessary for more rapid iterations and intermediate result storage.

In some methods, traditional Apriori methods are used in Map-Reduce. Transactions are allocated to Mappers and frequent k-itemsets are extracted from each Mapper before the results are shuffled through combiners and the final k-itemsets are extracted according to support and confidence thresholds. In Oruganti et al. [4], Kovacs et al. [5], Li et al. [6], Mappers and Reducers are used, but in [7, 8], in addition to Mappers and Reducers, Combiners are used for better shuffling and to address performance issues.

In Lin et al [9], three methods are proposed: Single Pass Counting (SPC), Fixed Passes Combined-counting (FPC) and Dynamic Passes Combined-counting (DPC). In Apriori, each iteration must generally wait until the results of all the Reducers from the previous iteration have been generated.

The FiDoop [10] method uses three Map-Reduce phases to generate frequent itemsets. FiDoop claims to support automatic parallelization, load balancing, data distribution, and fault tolerance. ScaDiBino [11] extracts rules with the maximum length and one target field. This method omits iterations. In Yu et al. [12], the Distributed Parallel Apriori (DPA) algorithm is proposed and metadata are stored in the form of Transaction Identifiers (TIDs). In this method, a single scan of the database is required and a balanced workload among nodes is created.

Various proposed methods use FP-Growth on a shared-nothing architecture. In Li et al. [13], parallel FP-Growth is proposed for independently executing a group of tasks on a node. In Bechini et al. [14], MRAC and MRAC+ are proposed, which are Map-Reduce-based and FP-Growth-based methods, respectively, and FP-Growth was modified to overcome performance issues. In Yang et al. [15], a Hadoop-based method is proposed that uses a distributed DH-TRIE frequent pattern algorithm and tries to solve FP-Growth problems for big data. In Tlili et al. [16], a partitioning method is used to achieve load balancing problems for association rule mining. PARMA [17] creates multiple small random samples of the input data and runs a mining algorithm on them. Because it is implemented on small samples, the algorithm can be run in parallel and independently. The results are aggregated to produce the final results. In Yu and Zhou [18], Tidset-based Parallel FP-tree (TPFP-tree) and Balanced Tidset-based Parallel FP-tree (BTP-tree) are proposed. In the proposed method, a transaction identification set (Tidset) is used to provide direct access to transactions instead of a full database scan. In Moens et al. [19], two methods are proposed: Dist-Eclat and BigFIM. Dist-Eclat has three steps: finding the frequent items, generating frequent itemsets of size k and sub-tree mining. BigFIM also has three steps: generating frequent itemsets of size k, finding potential extensions and sub-tree mining. The Sequence-Growth algorithm is designed according to the concepts of the lexicographical sequence tree and the lazy mining pruning strategy and is implemented in the MapReduce framework for a distributed execution [20]. In Liang et al. [21], an algorithm for lexicographic frequent itemset generation is proposed, which claims to find the maximum information from a database regarding frequent itemsets and their respective frequencies in a single database scan.

Methods

The proposed method uses the Map-Reduce architectural structure for association rule mining. The first step in the proposed method is to convert the input data items to a Kavosh format and the second step is to extract rules with variable lengths. In this paper, the number of fields after the “if” conditions are applied is defined as the length of the rule. In the proposed method, data are converted into Kavosh format, which helps nodes execute their processes independently. This format is a unified format and other input data formats are converted to this format. The proposed unified format helps create <key, value> pairs for use in the Map-Reduce method.

• Converting input data items to Kavosh format

In this section, the Kavosh format is defined

Θ = {θ₁, θ₂, …, θ_n, ϯ}.

The input data table is denoted as Θ, the columns of the table as θ and the table key as ϯ

θ_k= {µ₁, µ₂, ..., µ_m}.

The distinct values of each column are denoted as µ. In this paper, it is assumed that all θ values are discrete and that a continuous θ must be converted to discrete values.

Based on the above definitions, the Kavosh format is defined as follows:

ϝ = {θ₁→ μ₁, θ₂→ μ₂,…, θ_n→ μ_m}

where ϝ is the Kavosh table and each θ_p → µ_q is equal to zero or one. Each input column value is converted to a Boolean column if there are more than two values for the column.

The input data (Θ) are divided into equal segments. For fragmentation, the function ψ is used, which is defined as follows:

Ψ (ϝ, ϯ_start, ϯ_finish, η_k)

where η_k is the Mapper node to which the data ϝ are allocated from key ϯ_start to key ϯ_finish. From the previous conversion steps, a Boolean matrix is created on each Mapper. Each field ϝ can be selected as a target field. If a binary position is assigned to each field ϝ (except target fields), each row can be converted into a decimal number. Therefore, a <Key, Value> pair is created in which the Key is the decimal number of each row (Rule-Key) and the Value is the decimal value of the target field. In the next step, the Kavosh triple is created, which is defined as follows:

<Rule-Key, Rule count, Target field count with the specified value>

According to the generated <Rule-Key, Target fields>, the count of each Rule-Key and target field with the specified value is calculated and the Kavosh triple set is created. To convert Mapper data into the Kavosh triple and Rule-Key aggregation, the function Ϯ is used.

Ϯ(η_k, ϼ_k)

where ϼ_k is the result of computation of the input data. After calculating ϼ_k for each Mapper, all ϼ_k are reduced to ϼ_Reducer by the function £.

£(ϼ₁, ϼ₂,…, ϼ_p, ϼ_Reducer, ϧ)

where ϧ is the Reducer node and p is the number of Mappers.

• Rule extraction

In this section, a Map-Key is added to the Rule-Key (ϻ). The Map-Key is used for rule generation, and its length in binary format is equal to the Rule-Key. Here, ϼ_Reducer is sent to the rule generation nodes. Each rule generation node has one or more Map-Keys. The function ϖ allocates ϼ_Reducer to each rule generation node.

where ι is the Map-Key, q is the number of Map-Keys allocated to a rule generation node, and Ϧ is the rule generation node.

ϖ (ϼ_Reducer, ι₁, ι₂,…, ι_q ,Ϧ_r)

Each Map-Key is concatenated to all Rule-Keys. If the length of the Rule-Key in binary format is equal to L, then there are 2^L − 1 Map-Keys. They start from zero to 2^L − 1 and are concatenated with the Rule-Keys to create various condition combinations of fields ϝ. The concatenation (ϒ) of the Map-Key with the result of the “logical and” (ϰ) of the Map-Key and the Rule-Key is called MHB_Key, and τ is the MHB_Key.

τ = ϒ (ι_i, ϰ (ϻ, ι_i))

There are two important factors for rule extraction in association rule mining: confidence and support. In this paper, Д (τ) is defined as the frequency of a specified MHB_Key (τ) and ζ (τ, σ, ρ) is defined as the frequency of a specific decimal value (ρ) that is equal to the target field (σ) for a specified MHB_Key (τ). In addition, ς is the total number of generated Kavosh triple sets. Thus, the support ( ) is calculated using the following formula:

Moreover, the confidence ( ) is calculated using the following formula:

After and have been calculated, they are compared with the defined thresholds for the association rules and the final results are produced. The rules are extracted using Д.

where α is the support threshold and β is the confidence threshold. Each rule generation node creates its final result separately because the results do not have common rules with other nodes. Figure 1 shows the Kavosh architecture. It consists of three layers: the Mapper layer, the Reducer layer and the rule generation layer.

For clearer illustration, pseudocodes are provided. Table 1 lists the functions used in the code.

Table 1 Pseudocode functions

Kavosh: an effective Map-Reduce-based association rule mining method

Abstract

Introduction

Related works

Methods

• Converting input data items to Kavosh format

• Rule extraction

Evaluation

TPC-DS

Execution time

Compression

Load balancing

Real traffic data of a mobile operator

Compression

Load balancing

Scalability

Conclusion

References

Authors’ contributions

Acknowledgements

Ethics approval and consent to participate and Consent for publication

Competing interests

Availability of data and materials

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords