Chabok: a Map-Reduce based method to solve data warehouse problems

Barkhordari, Mohammadhossein; Niamanesh, Mahdi

doi:10.1186/s40537-018-0144-5

Research
Open access
Published: 26 October 2018

Chabok: a Map-Reduce based method to solve data warehouse problems

Mohammadhossein Barkhordari¹ &
Mahdi Niamanesh¹

Journal of Big Data volume 5, Article number: 40 (2018) Cite this article

4731 Accesses
7 Citations
1 Altmetric
Metrics details

Abstract

Currently, immense quantities of data cannot be managed by traditional database management systems. Instead, they must be managed by big data solutions using shared nothing architectures. Data warehouse systems are systems that address very large amounts of information. The most prominent data warehouse model is star schema, which consists of a fact table and some number of dimension tables. It is necessary to join the facts and dimensions for query executions on the data warehouse. In shared nothing architecture, all of the required information is not placed on a single node so it is necessary to retrieve information from other nodes, which causes network congestion and low speeds of query execution. To avoid this problem and achieve maximum parallelism, dimensions can be replicated over nodes if they are not too large. However, if there are dimensions with data volumes greater than the capacity of a node or dimensions where the data volume summation exceeds node capacity, the query execution is confronted with serious problems. In big data problems, the amount of data is immense, and thus replicating immense data cannot be considered an appropriate method. In this paper, we propose a method called Chabok, which uses two-phased Map-Reduce to solve the data warehouse problem. In this method, aggregation is performed completely on Mappers, and intermediate results are sent to the Reducer. Chabok does not need data replication for join omission. The proposed method was implemented on Hadoop, and TPC-DS queries were executed for benchmarking. The query execution time on Chabok surpassed prominent big data products for data warehousing.

Introduction

Existing information is a valuable asset for many different types of organizations. Storing and analysing information can solve many problems within an organization [1]. The results from data analyses help organizations make correct decisions and provide better services for customers. Thus, high speed storage and retrieval of large volumes of data generated by electronic devices and software systems are critical issues [2,3,4]. Many organizations consider big data solutions because they cannot manage their data with traditional database management systems [5]; therefore, they must seek drastic measures for the design and implementation of new systems according to big data architectures. These organizations must change their architectures from single-node to multi-node platforms. This transformation is not easy and requires a paradigm shift for data placement on different nodes [6,7,8].

Data warehousing is one of first and most important systems to accomplish these sophisticated changes. Because they contain the historical data of an organization, data warehouses hold large amounts of information in comparison with other systems [9, 10]. Legacy data warehouses used a single node but, due to increasing data volume and the need for high-speed query processing, shared memory and disk architectures were used. In this architecture, hardware nodes use common memory for data storage. However, this type of architecture cannot solve the query speed problem because using common memory to store data forms a bottleneck. The only remaining architecture is shared nothing architecture in which information is divided across various nodes. One of the prominent methods used in shared nothing architectures is Map-Reduce [11]. In Map-Reduce, a map function is executed on each node and then a reduce function on Reducer nodes collects intermediate Mapper results and generates the final results.

Map-Reduce is useful for data warehouse problems. Information is allocated to each Mapper and a query is then executed. In the next phase, the Reducer aggregates the results of each Mapper and creates the final results. However, using shared nothing architecture creates a new problem: the absence of data required for processing, or in other words, each node requires other nodes to execute its query. This problem is called a data locality problem, and the need to wait for other node data also causes network congestion.

The main components of data warehouse are fact, measures and dimensions. “Facts represent atomic information elements in a multi-dimensional database. A fact consists of quantifying values stored in measures and a qualifying context which is determined through (terminal) dimension levels. Each dimension level contains a set of instances or elements” [12]. “A distributive measure is a measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set” [13].

The star schema data warehouse includes a fact table and some number of dimensions. The fact table has much larger data records than the dimension tables. To fragment and allocate data warehouse information over nodes, different methods have been proposed. Some methods try to accelerate query execution by putting some metadata in each node. These methods improve query execution time but the need to exchange data among nodes remains. Other types of methods try to replicate and collocate data in order to achieve node independence. However, in big data problems, these methods make the volume of big data balloon, which is unacceptable for already immense amounts of data.

In this paper, we propose a method called Chabok that not only solves the data locality problem completely but solves network congestion problems as well. In Chabok, a two-phased Map-Reduce method is used for data warehouse problems with big data. Chabok is used for star-schema data warehouses and can compute distributive measures. This method can also be applied to big dimensions, which are dimensions where data volume is greater than the volume of a node.

Related works

In this section, we investigate related works that try to solve Map-Reduce problems related to the data warehouse. Hadoop++ [14] creates an index called Trojan. This method uses data collocation and co-partitioning to support the join operator, and join execution is done on the Mappers. Hail is another method with a shorter index length than Hadoop ++. CoHadoop [15] intentionally collocates data on the nodes. Using this method, related data are placed together, and a data structure called the Locator is added to the HDFS (Hadoop file system). Using this method, Map-side join without data shuffling is possible. Llama [16] uses a columnar file (CFile) and is implemented on the HDFS. Queries in this method are only extracted from related CFiles, and it is not necessary to scan all files. Osprey [17] fragments table data between nodes, and each fragmentation is allocated to a node. Queries are divided into sub-queries and executed simultaneously on each node. GridBatch [18] is the same as CoHadoop, but colocation occurs at the file system layer. Arvand [19] is a method that integrate multi-dimensional data sources into big data analytic structure like Hadoop. NoAM [20] is an abstract model for NoSQl databases that extracts commonalities of various NoSQL systems. In [21], a method is proposed that transfers legacy data warehouses to Hive [22]. In [23], data from legacy data warehouses are transferred to Hive by a rule-based method. In [24], three physical data warehouse designs were investigated to analyse the impact of attribute distribution among column-families in HBase based on OLAP query performance. The authors conclude that OLAP query performance in HBase can be improved by using a distinct set of attribute distributions among column-families. In [25], three types of transformation are covered. In the first method, dimensions and measures are directly transferred to NOSQL (one table for each fact and dimension). In the second method, one table is transferred. Facts and dimension information are merged in that table. The last method is similar to the second method but with one difference: it uses a column family instead of a simple attribute.

In addition to the columnar format, Cheetah [26] uses compression methods. RCFile [27] uses horizontal and vertical partitioning. First, the data are partitioned horizontally, and each section is partitioned vertically. CIF [28] is a binary columnar method that first divides data horizontally, creates a directory for each partition and then creates a subdirectory for each column. A metadata file keeps directory information. MRShare [29] divides a job into queries and creates the provision that the previous execution results can be used if it is necessary to re-execute a query. ReStore [30] is a method that stores intermediate results for future calculations. In HadoopDB [31], a DBMS (Database management system) is installed on each node. Hadoop manages coordination among nodes. Using this method, it is possible to use DBMS features for local nodes. SAM [32] is a method that creates communication between Mapper nodes to decrease the query execution time. ScaDiPasi [33] uses a unified data format to create a data warehouse on information of patients. Clydesdale [34] is a method used for structured data that uses a star schema model to improve query execution. AQUA[35] is a query optimizer that manages intermediate results. AQUA uses a two-phased method to execute queries. In the first step, queries are divided into groups that can be executed together, and the results of each group are combined in the second phase. YSmart [36] is a method for converting SQL (Structured query language) into Map-Reduce jobs in which related data are processed in a job. By using this method, the total number of jobs are decreased. In the Rope [37] method, job optimization is done by gathering statistical data about the job. These data are applied on the same job or similar jobs. BlinkDB [38] is designed for interactive query processing on immense data. The Flink [39] method is designed primarily for stream processing. It generalizes batch processing using Dataset API. Flink supports various concepts in time-based windows such as event-based processing, time-based processing, and row count-based processing. Aras [40], Atrak [41] and Hengam [42] use data unification and in-Memory database to achieve higher performance on data warehouse query execution.

Some methods use In-Memory techniques to decrease the Map-Reduce job execution time. PowerDrill [43], Shark [44], Spark and M3R [45] are prominent methods. PowerDrill is a column-based method. Shark and Spark [46] use In-memory data sets called RDD. In the case of RDD corruption, each RDD can rebuild itself by using existing data from a previous RDD. The M3R method improves Hadoop performance by omitting portions such as Heart beat or Job Tracker.

All these methods attempted to provide data locality, but none can claim to provide data locality completely. Each method tries to support data locality by changing different parts of the Map-Reduce method and improving data retrieval time. The method proposed in this paper has the following advantages in comparison with existing methods:

Complete data locality
Network congestion omission
Data replication and collocation omission.

Methods

Problem definition

In this section, the proposed method that uses MapReduce to solve data warehouse problem is investigated. As it was explained in Related works section, there are many proposed methods to solve data warehouse problem for big data. But the biggest problem in all of them is data locality which is the absence of data required for processing on the same node. In the proposed method data locality problem is covered successfully and it is the main value in comparison with other prominent methods. The proposed method uses data locality to decrease query execution time, decrease network congestion and perfect use of node process power. This method can be used for data warehouses with star-schema model and distributive measures.

Chabok

In this section, we describe Chabok, our proposed method to solve the data warehouse problem with big data. The proposed method is useful for star-schema data warehouses. It can execute distributive measure functions on the data warehouse. First data from star schema must be transferred to Chabok architecture. The Chabok method uses a two-phase Map-Reduce for distributed data warehouses. Figure 1 depicts the Chabok architecture.

The following flow-charts shows data transformation process for fact table and dimensions.

In Fig. 2 transformation, Fact fragmentation in the proposed method is a horizontal fragmentation. If there are homogeneous nodes, the same number of records is allocated to each FactMapper node. In this paper, FactMapper and DimensionMapper nodes are homogeneous.

In Fig. 3 transformation, dimension information is fragmented on one or more nodes according to its volume. Some dimensions may have small volumes, and thus it is possible to store more than one dimension on a node. This part of the architecture solves the big dimension problem that exists when large dimensions cannot be allocated to one node.

The proposed method uses two-phased Map-Reduce to solve the data warehouse problem. The first phase contains Fact table data and second phase contains dimensions data. The first MapReduce executes distributed measure functions on Mapper data. The results are aggregated on the Reducer node. If the conditions on Fact data are required, these conditions are applied on the Mapper. If the conditions on the results of distributed measure functions are required, these conditions are applied on the Reducer. To execute an input query on Chabok architecture nodes, it is necessary to have a query language. This intermediate language specifies which different conditions that are defined by users must be applied on which layers. We call this query language MHBQL. MHBQL consists of five parts:

Selected dimensions

{Dimension ₁ .Attribute _1, Dimension ₂ .Attribute _1, …, Dimension _n .Attribute _m }
Distributive measures

{[Distributive measure _1, measure ₁ ],[Distributive measure _1, measure ₂ ],…,[Distributive measure _n, measure _m ]}
Conditions on dimensions

{[Dimension ₁ .Attribute ₁ ,operator, value], [Dimension ₂ .Attribute ₁ ,operator, value],…, [Dimension _n .Attribute _m , operator, value]}
Conditions on measures

{[measure ₁ , operator ₁ , value ₁ ], [measure ₂ , operator ₂ , value ₂ ],…, [measure _n , operator _n , value _n ]}
Conditions on distributive measures

{[Distributive measure ₁ (measure ₁₎ , operator ₁ , value ₁ ], [Distributive measure ₂ (measure ₂ ), operator ₂ , value ₂ ],…,

[Distributive measure _n (measure _n ), operator _n , value _n ]}

For the And operator, “^” is used. For the Or operator, “|”is used. For priority, “/” and “\” are used.

The first four parts of MHBQL are used for FactMappers and the fifth part is used for FactReducer. Following flow-chart shows query execution process.

Following code shows Map(FactMapper) function.

Following code shows Reduce(FactReducer) function.

The Second phase MapReduce is for dimensions data retrieval. In the Mapper phase each key is sent to its related Dimension node and requested data from each dimension is retrieved by join function. Following code shows Dimension Mapper function.

In DimensionReducer phase the results from each DimensionMapper are placed together to generate the final results. Following code shows Dimension Reducer function.

Formal definitions

In this section, notations that are used in this paper are defined.Ω is used to show the dimensions set. Each dimension set has members, which are shown by ω.

$$\varOmega = \{\upomega_{ 1},\upomega_{ 2}, \ldots ,\upomega_{\text{k}} \}$$

Each dimension has a Key and some attributes, which are shown by ψ_Key and ψ_i, respectively.

$$\upomega_{\text{m }} = \left\{ { \,\uppsi_{Key} , \,\uppsi_{ 1} , \,\uppsi_{ 2} , \, \ldots , \,\uppsi_{\text{n}} } \right\}$$

The measure set is shown by Θ, and each member is shown by θ_i.

$$\varTheta = \left\{ {\uptheta_{ 1,}\uptheta_{ 2,} \ldots , \,\uptheta_{\text{p}} } \right\}$$

Distributive measures are shown by Ζ, and each member is shown by ζ_i.

$${\rm Z} = \{\upzeta_{ 1,}\upzeta_{ 2,} \ldots , \,\upzeta_{\text{q}} \}$$

A Fact table is defined as including measures (θ_i), dimension keys (ω_i → ψ_Key) and a fact table key (ξ).

ϝ = {θ₁, θ₂, …, θ_r, ω₁ → ψ_Key, ω₂ → ψ_Key, …, ω_s → ψ_Key, ξ}

We define Λ as the operators set, α as a set with numeric and string values and β as a set with numeric values only.

$$\begin{aligned} \varLambda \, = & \, \{ = , \, > , \, < , \, \le , \, \ge , \ne \} \\\upalpha = & \, \left\{ {\text{string and numeric values}} \right\} \\\upbeta = & \, \left\{ {\text{numeric values}} \right\} \\ \end{aligned}$$

In the proposed method, there are two Map-Reduce phases: Fact Map-Reduce and Dimensions Map-Reduce. FactMappers and FactReducer are defined as μ and η, respectively.

$${\text{M }} = \, \left\{ {\upmu_{ 1} , \,\upmu_{ 2} , \ldots , \,\upmu_{\text{u}} ,\upeta} \right\}$$

DimensionMappers and DimensionReducer are defined as δ and γ, respectively.

$${\text{N }} = \, \{\updelta_{ 1} , \,\updelta_{ 2} , \ldots , \,\updelta_{\text{v}} ,\upgamma\}$$

The fragmentation and allocation function is applied to the fact table, which allocates Fact rows to FactMappers (μ) according to the Fact key (ξ).

Ψ(ϝ, ξ_Start, ξ_End, μ_w)

The fragmentation and allocation function is applied to the dimension table, which allocates Dimension rows to DimensionMappers (δ) according to the Dimension key (ψ).

$$\varPhi \, (\upomega_{{{\text{x}},}}\updelta_{{{\text{y}},}}\uppsi_{StartKey} , \,\uppsi_{EndKey} )$$

The user input query (Π) includes selected dimensions and distributive measures. The input query can have conditions on dimensions (Ϣ), measures (Γ) and distributed measures (ϛ).

Π(Ω, Z(Θ), Ϣ_{(Λ, α)}(Ω),Γ_{(Λ, β)} (Θ), ϛ_{(Λ, β)} (Ζ(Θ))

FactMapper

The map function of FactMapper uses the first four parts of MHBQL that are translated as a comprehensible query to FactMapper. The map function executes on each FactMapper. Because the input query on all FactMappers is the same, the results of the Mappers have the same format. The FactMapper function is shown as follows.

Ϡ (Ζ(Θ), Ϣ_{(Λ, α)}(Ω), Γ_{(Λ, β)}(Θ), δ_i)

This function has four parameters: distributive measures, conditions on measures, and conditions on fact dimension fields as well as a final parameter specifies FactMapper. Distributive measures are applied to fact measures. Conditions on measures are also applied directly to fact measures. To apply conditions to dimensions, it is necessary to first send conditions to each related dimension; the returned dimension keys are then applied to the dimension fields of the fact table as conditions.

In the Chabok architecture, the information about mapping dimension fields of fact table and dimension tables is stored in the MetaDimension. This table also saves information about the physical node, which stores dimension information. The MetaDimension is used to translate an MHBQL query into a comprehensible query for FactMapper nodes.

FactReducer

In this section, the intermediate results that are produced by FactMappers are used as input for a FactReducer. The FactReducer aggregates the first phase intermediate results and produces the second phase intermediate results, which are made of dimension keys and aggregated measures.

To achieve higher computational speed, Chabok stores intermediate results in the RAM memory. We call this distributed intermediate in memory data sets Medatum. Each FactMapper creates a Medatum and these Medatums are sent to FactReducer. The FactReducer aggregate Medatums and creates a new Medatum. The FactReducer (Ϯ) function is shown by the following.

Ϯ(Ζ(Θ_η), ϛ_{(Λ, β)}(Ζ(Θ_η)), η)

The FactReducer function has three parameters: distributed measures, conditions on result-distributed measures and the FactReducer node. Distributive measures are applied to FactMapper Medatums, and the final results are generated. If there are conditions on the fifth part of the input MHBQL, these conditions are applied in this section to the results of the distributive measures.

DimensionMapper

As mentioned before, the generated Medatum from a FactReducer consists of dimension keys and the results of applying distributive measures to Fact measures. In the DimensionMapper phase, each dimension key is sent to its related DimensionMapper according to MetaDimension information. Then, due to the first part of MHBQL, other necessary data from each dimension is extracted. An illustrated operation is shown as follows.

⋈((ω_i → ψ_Key)_η, δ₁, δ₂,…, δ_v)

In the proposed method, information retrieval from dimension nodes can be achieved simultaneously. This can reduce data retrieval time for big dimensions that are distributed among multiple nodes.

DimensionReducer

In this part, Medatums that are generated by DimensionMappers are combined with the FactReducer on dimension keys to produce the final results. The FactReducer Medatum contains items that are defined in the first and second parts of the input MHBQL. DimensionReducer operation is shown as follows.

$$(\varTheta_{\upeta} ,\varOmega_{\upeta} ,\gamma )$$

Replication

To achieve replication in the proposed method, FactMappers and DimensionMappers must be replicated. In other words, if a Replica-Factor of three is required, it is necessary to copy each FactMapper and DimensionMapper node onto two other nodes. In the MetaDimension table, replication nodes for each FactMapper and DimensionMapper are determined. It is necessary to note that this replication is for data backup only and not for performance issues. The Replica-Factor can be set to one if data redundancy is not necessary.

Case study

For example, consider a bank data warehouse that has star-schema model and is built on the transactions of an EFT^{Footnote 1} switch. The properties of the data warehouse are as Table 1.

Table 1 Properties of the data warehouse

Chabok: a Map-Reduce based method to solve data warehouse problems

Abstract

Introduction

Related works

Methods

Problem definition

Chabok

Formal definitions

FactMapper

FactReducer

DimensionMapper

DimensionReducer

Replication

Case study

Results

Experiment setup

Experimental platform

Experiment settings

Benchmarks

Results and analysis

Query execution time

Load balancing

Network congestion

Scalability

Conclusion

Notes

References

Authors’ contributions

Acknowledgements

Competing interests

Consent for publication

Availability of data and materials

Ethics approval and consent to participate

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords