Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Costa, Eduarda; Costa, Carlos; Santos, Maribel Yasmina

doi:10.1186/s40537-019-0196-1

Research
Open access
Published: 06 May 2019

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Journal of Big Data volume 6, Article number: 34 (2019) Cite this article

30k Accesses
22 Citations
1 Altmetric
Metrics details

Abstract

Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

Introduction

One of the fundamental reasons for the notoriety of the Big Data phenomenon is the current extent to which information can be generated and made available [11], mainly due to the constant innovation, transformation, globalization and personalization of the services associated with new business models. Many definitions of the Big Data concept exist, mainly aligned with the consensus that Big Data can be defined as large amounts of data, flowing at different velocities, with varying degrees of complexity, without structure and/or organization, which cannot be processed or analyzed using traditional processes or tools [11, 18, 23, 36].

One of the most popular approaches for managing large-scale datasets in a structured way is by the use of a Data Warehouse (DW), a repository with analytical purposes that is mainly responsible for integrating and storing data coming from operational systems, and that is widely considered as a fundamental enterprise asset to support decision-making. However, data volume is nowadays a major challenge for the DW, taking into consideration its traditional supporting technologies. Moreover, current data types and formats are also a major problem, since they challenge the fundamentals of DW processing, as these cannot be applied to free text, images, videos or sensor data [18]. Due to this current conceptual, technological and organizational context, the design and implementation of Big Data Warehouses (BDWs) is becoming an important area of study [6, 7, 13, 18, 20]. These repositories substantially differ from traditional DWs, since they must be based on new logical models, more flexible than the relational ones, and new technologies with higher levels of performance, scalability and fault-tolerance [14, 23].

Hadoop, an open source ecosystem for reliable, scalable and distributed computing [1], emerged as a solution to address Big Data processing on low-cost platforms, providing the computational resources to handle these large amounts of data [18]. Moreover, Hive, which is built on top of Hadoop, emerged as a system to store, query and manage large data volumes stored in distributed environments. Since its appearance, research in the area of Big Data Warehousing has been intensified, with developments aiming to bring the well-known concepts from relational databases, such as declarative query languages, tables and columns, into the unstructured environment of Hadoop. These characteristics, along with the metastore concept, i.e., the system catalog with the metadata information, contributed to the classification of Hive as a DW repository for Big Data [24]. In this sense, Hive is a distributed DW system that manages the data stored in HDFS (Hadoop Distributed File System) and provides a SQL-like language (HiveQL) for querying the data [3, 26]. For data storage, Hive has four main components for organizing data: databases, tables, partitions and buckets. Partitions and buckets can theoretically improve query performance, as tables are split by the defined partitions and/or buckets, distributing the data into smaller and more manageable parts [27].

This is a recent area of research where there is a lack of related work on the way data must be organized in Hive, as well as on the impact of that organization in query performance. Several open issues need further exploration from the scientific community, reason why the fundamental research questions of this work are expressed as follows: Are there any significant advantages in using partitions and/or buckets in Hive-based BDWs? Do these organization strategies have any impact on the efficiency of online analytical processing (OLAP) queries? What factors may influence the definition of an appropriate data organization strategy?

Given this context, this work has as main motivation verifying to what extent the way in which data is modelled and organized influences the query processing time of BDWs. Partitioning and bucketing strategies can be used when building BDWs, but they can be neglected by the practitioners or, sometimes, used in an ad hoc manner. The insights from this paper can be used to improve the knowledge-base regarding the guidelines for creating partitions and buckets, which we consider as a topic that is frequently unknown or subjective for (Big) Data Warehousing practitioners. For addressing this main concern, this study aims to understand the impact of different data organization strategies in the query processing time of BDWs, extending the preliminary work and results addressed in [10], specifically focusing on the following aspects: (i) the relationship and impact between the definition of partitions and buckets in Hive, either individually or combining these two strategies; and, (ii) how the data processing workloads are affected regarding query processing time, as the volume of data that needs to be manipulated in a specific query can be significantly reduced with the adoption of an appropriate distribution of the data. As the implementation of BDWs is a significantly recent area of research, almost no guidelines are available regarding the way these repositories can be organized for increasing the overall performance of the system. Consequently, after the presentation, evaluation and discussion of the results, this paper summarizes a set of good practices for the modelling and organization of data in Hive-based BDWs.

This paper is structured as follows: “Related work” section presents the related contributions in this topic. “Methods/experimental” section describes the technological infrastructure, the dataset and the test scenarios used in this research process. “Results” section describes the obtained results, highlighting the performed benchmarks and the needed resources, both in terms of processing time and central processing unit (CPU) usage. “Discussion” section discusses the obtained “Results” and “Conclusions” section presents the main conclusions, pointing the usefulness and applicability of the several strategies for organizing BDWs.

Related work

Data models have been key components in Business Intelligence and Analytics (BI&A) systems, ensuring that the analytical needs of the business are properly integrated and considered, allowing data analysis through different perspectives [27, 28]. In a traditional BI&A context, dimensional data models are the most popular ones [17], including star schemas for the different considered business processes. Although very useful, these logical models are not usually appropriate for Big Data contexts, requiring the adoption of new logical constructs that address the characteristics of NoSQL databases and the associated technologies available in the Hadoop environment [14]. In the work of [6, 25], the authors highlight that the design of a BDW should focus not only on the physical layer (the technological infrastructure), but also on a logical layer, giving an overall perspective on the data models, the logical components and how the data flows throughout the components. For [21], the design methodology of a BDW should be highly agile and iterative, integrating as many data sources as possible (either internal or external to the organization), and may use or not a rigid data model, aiming for a fast understanding and perception of the data.

Currently, SQL-on-Hadoop systems are significantly popular solutions for querying data available in a Hadoop cluster, of which several can be highlighted: Hive; Presto; Spark SQL; Drill; and Impala. Due to their popularity, several benchmarks compare their performance, as for instance the available in [9, 29]. However, SQL-on-Hadoop benchmarks do not usually consider the impact of the data models, addressing mostly how fast these systems can be considering different workloads.

In the context of a BDW and having into consideration that Hive is the main Data Warehousing solution in Hadoop, supporting queries in HiveQL, it is important to understand how the way data is stored and organized in this system affects the performance of the solution. Thus, as previously mentioned, this system supports three types of data structures, namely tables, partitions and buckets [12, 31], included in databases. The concept of tables in Hive is similar to the concept of tables in relational databases (common structures with columns and rows), and each table corresponds to an HDFS directory. A Hive’s table can have one or more partitions that define the distribution of the data within subdirectories of the table’s directory, splitting the data horizontally and speeding up query processing. The buckets correspond to file segments in HDFS and can only be applied to a single attribute. These structures help to organize data in each table/partition by dividing it by several files. To identify the segment to which a data record must be assigned, a hash function is applied on the bucketing column. Consequently, it is a technique for grouping data vertically, segmenting data records by a given attribute. Each bucket is stored as a file within the table’s directory or the partitions’ directories [12, 15, 27, 31].

Regarding data modelling, an evaluation of different data modelling and organization strategies for Hive-based DWs is described [9], showing the benefits of implementing a BDW based on a fully denormalized table, when compared with a dimensional structure (star schema). Moreover, [4, 5, 35] analyzed the implementation of BDWs based in NoSQL databases. While [4] studied the implementation of a DW based on a document-oriented NoSQL database and [5] explored implementations of DWs on top of column-oriented NoSQL databases, [35] proposed a transformation process for moving from a dimensional DW into a column-oriented and document-oriented NoSQL data model.

Regarding the data organization strategies, the creation of partitions and buckets in Hive has already been addressed in the literature. Kumar [19] presented a brief performance analysis and comparison of MySQL partitions, Hive Partition/Bucketing and Apache Pig, highlighting the Hive’s advantages with the use of partitioning and bucketing techniques. To [30], Hive partitioning can be used for improving the performance of a very specific set of queries, as long as the partitions are aligned with the attributes used in the queries’ filters. Moreover, in [27], it is recommended that the attribute, or attributes, used for partitioning have low cardinality, avoiding the creation of a significantly high number of subdirectories, a process that will overload HDFS. Furthermore, according to [2], partitioning can improve query performance in large datasets, when, as already mentioned, the partition scheme considers the attributes used in the queries’ filters. These benefits were also shown in [9], presenting the advantages of creating data partitions using two different data organization strategies (star schemas and fully denormalized tables).

Partitioning requires the use of an attribute that does not create a large number of small partitions, avoiding a large number of small files that typically slow down the processing time of Hadoop [30], while bucketing clusters large data sets into more manageable parts, corresponding to file segments in HDFS [2]. This means that bucketing is an ideal technique for sampling and joining tables more efficiently. For [27], buckets help to organize the data in each partition, distributing the data in several segments, being useful for attributes with high cardinality. The work of [30] highlights other useful considerations for using bucketing in Hive, namely: it is useful for fact tables in a star schema; map-side joins can be more efficient if the joining attribute is bucketed; the bucket file size should have, at least, 1 GB; the number of buckets cannot be changed after the creation of the table; processing times can also be improved by combining bucketing with sort techniques. In general, bucketing may also optimize execution times, namely when bucketing by the attributes used in the queries’ “group by” and “order by” clauses and when a bucket has at least the size of one HDFS block or a multiple of that size. Besides these contexts, the use of bucketing is usually discouraged. However, all these considerations are theoretical considerations, not corroborated by any type of practical work or performance analysis, which emphasizes the lack of studies about the real impact of the implementation of bucketing techniques.

Nowadays, and due to the youth of this research area, scientific papers related with data organization strategies in a BDW are scarce. Despite some of the mentioned studies already considering some partitioning strategies, there is a significant absence of works analyzing the impact of bucketing, the combination of partitioning and bucketing on Hive’s data models, and how the use of these techniques can be optimized. Therefore, this work, extending the work previously presented in [10], seeks to fulfil these scientific gaps by addressing different data organization strategies, i.e., by benchmarking different combinations of partitions and buckets for two different data modelling patterns, based on star schemas and fully denormalized tables, as these are the most common modelling approaches used when implementing Hive-based BDWs. To accomplish this task, several workloads were tested using different scale factors (SFs), providing a clear overview of the impact of partitioning and bucketing strategies in these data modelling patterns.

Methods/experimental

Considering that the main goal of this work is the proposal of some best practices for modelling and organizing Hive-based BDWs, it is important that the guidelines and considerations here provided are adequately validated and the results are replicable. Therefore, a benchmark that includes several workloads was conducted to evaluate the performance of a Hive BDW in different scenarios. This section describes the materials and methods used in this research process.

Technological infrastructure

For this study, a Hadoop cluster including five nodes with similar configurations was used. Each node is composed of the following components:

(i)
1 Intel Core i5, quad core, with a clock speed ranging between 3.1 GHz and 3.3 GHz;
(ii)
32 GB of 1333 MHz DDR3 Random Access Memory (RAM), with 24 GB available for query processing;
(iii)
1 Samsung 850 EVO 500 GB Solid State Drive (SSD) with up to 540 MB/s read speed and up to 520 MB/s write speed;
(iv)
1 Gigabit Ethernet card connected through Cat5e Ethernet cables and a gigabit Ethernet switch;
(v)
The operating system installed in all nodes is CentOS 7 with an XFS file system.

In this infrastructure, one of the nodes is configured with the HDFS NameNode and YARN ResourceManager, assuring the typical management roles in Hadoop, and the other four nodes are configured as HDFS DataNodes and YARN NodeManagers.

The Hadoop distribution used in this work is the Hortonworks Data Platform (HDP) 2.6.0 with the default configurations, excluding the HDFS replication factor, which was set to 2. Besides Hadoop (including Hive), Presto v.0180 is also available, being the coordinator installed on the NameNode and the workers on the four remaining DataNodes. All Presto’s configurations were left to their defaults, except the memory configuration, which was set to use 24 GB of the 32 GB available in each worker (similar to the memory available for YARN applications in each DataNode/NodeManager).

Dataset and queries

In this work, the well-known star schema benchmark (SSB) was used, which considers a traditional sales data mart modeled according to dimensional structures (star schemas). This benchmark is based on the TPC-H Benchmark [33], with the necessary adaptations to transform the data model into a star schema, as can be seen in [22]. From the proposal of [22] and the data schema here used, there are some particular differences, namely: i) the original TPC-H scale factor of the customer and supplier tables was left unchanged, since in real contexts it is possible to have large customer and supplier dimensions, as happens in large e-commerce enterprises and social media networks; ii) a temporal dimension with less attributes than the one used by [22] was created, maintaining only the attributes that are relevant for executing the workloads available in [22], in order to keep a leveled ground between the two types of data modelling strategies evaluated in this work (star schemas and denormalized tables).

Therefore, both SSB’s relational tables and the fully denormalized table were implemented in the Hive BDW, being stored using the Optimized Row Columnar (ORC) format and compressed using ZLIB. Besides the dataset, this work also uses the 13 queries included in the SSB benchmark, measuring the performance of the BDW in typical OLAP workloads. The 13 queries are available in the work of [22] and, also, in [8] that provides all the scripts used in this work to run the queries in Hive and Presto. For having an overall overview of the queries and their patterns, the following listing code shows the first query of each group, as the SSB includes four groups of queries, as will be seen in the following subsection.

Test scenarios

In order to understand the impact in query processing times when using different strategies for data partitioning and bucketing, several test scenarios were defined (Fig. 1). In these scenarios, two different data models (star schema and denormalized table) are tested for three different SFs (30, 100 and 300), following the application of three main data organization strategies: partitioning by multiple attributes, bucketing and the combination of both. For each SF, the SSB data is stored in HDFS, and Hive tables are created for both data organization strategies. The queries are executed in Presto and Hive (on Tez). The selection of these two SQL-on-Hadoop engines takes into consideration the results in [29]. Moreover, considering the work of [9], the broadcast join strategy was used for Presto to optimize the star schema processing times, in order to assure that they are comparable to the results of the denormalized table.

The study of the cardinality and distribution of the attributes available in the dataset was done to choose the attributes and the several combinations among them, in order to adequately plan partitioning strategies, bucketing strategies and the combinations of both. Regarding the denormalized table, for the highest SF used in this work, it was not possible to replicate all the scenarios due to the memory limitations of the infrastructure used in this work.

To obtain more rigorous results, several scripts were developed to sequentially execute each query four times. The results in this work are presented as the average of the four executions. These scripts were adapted according to the SQL-on-Hadoop system in use (Presto or Hive), the applied data model (denormalized or star schema) and the data organization strategy (with or without partitions and buckets).

Results

After the work of [9], showing the advantages of simple partitioning using the attributes more frequently used in the query filters, and considering the work described in [10], this paper extends that previous work and presents the results obtained with: (i) the use of a multiple partitioning strategy; (ii) the use of different bucketing strategies (simple and multiple); and (iii) the combination of partitioning (simple and multiple) and bucketing strategies.

Despite the results depicted in [9], regarding the advantages of using a fully denormalized table over a dimensional model based on a star schema in Hive, this work also extends the comparison between these two data modelling techniques by applying different partitioning and bucketing strategies not only to a denormalized table but also to a star schema.

To give a global overview of the efficiency of the different strategies, extending the focus of the analysis besides query processing time, the impact of the data organization strategies in the use of the CPU was also studied. Therefore, after presenting the time needed for processing the several workloads, each subsection ends with a study of CPU usage, taking as examples some scenarios used for the processing time analysis.

All the processing times for each query and for the several scenarios are presented in the next subsections without decimal places, for the sake of clarity and simplification in the visualization of results.

Multiple partitioning

As previously mentioned, the work of [9] showed that simple partitioning, using an attribute that frequently appears in the “where” clause of the queries, has benefits in terms of processing time. Having that in mind, this subsection presents the results obtained when tables are partitioned by more than one attribute, continuing to study the impact of this type of data organization strategy. Along this subsection, the fastest processing time for each query, workload, tool and data model is highlighted in italics when illustrating the results of the benchmark. Table 1 shows the results when the attributes “Od_Year” (order year) and “S_Region” (supplier region) were considered as partitioning attributes. These attributes are used as filters in 11 out of 13 SSB queries, either appearing individually or combined in the queries’ “where” clauses. As can be seen, this scenario highlights the advantages of multiple partitioning when compared with no specific data organization strategy in terms of partitions and/or bucketing. In a star schema context, the decrease in the overall processing time reaches 42% in Hive and 46% in Presto. In the context of a denormalized table, the decreases in Hive vary between 16 and 45%, while with Presto the decrease can be over 50% (54% in the best scenario).

Table 1 SSB execution times (in seconds): partitioning by “Od_Year” and “S_Region” (star schema (SS), star schema with partitions (SS-P), denormalized table (DT), denormalized table with partitions (DT-P))

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Abstract

Introduction

Related work

Methods/experimental

Technological infrastructure

Dataset and queries

Test scenarios

Results

Multiple partitioning

Bucketing

Combination of partitioning and bucketing

Star schema

Denormalized table

CPU usage

Synopsis

Discussion

Main insights

Guidelines for practitioners

Conclusions

Notes

Abbreviations

References

Authors’ contributions

Acknowledgements

Competing interests

Availability of data and materials

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords