Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud
© The Author(s) 2016
Received: 18 June 2016
Accepted: 29 August 2016
Published: 18 October 2016
Big Data is the term used for larger data sets that are very complex and not easily processed by the traditional devices. Today is the need of the new technology for processing these large data sets. Apache Hadoop is the good option and it has many components that worked together to make the hadoop ecosystem robust and efficient. Apache Pig is the core component of hadoop ecosystem and it accepts the tasks in the form of scripts. To run these scripts Apache Pig may use MapReduce or Apache Tez framework. In our previous paper we analyze how these two frameworks different from each other on the basis of some parameters chosen. We compare both the frameworks in theoretical and empirical way on the single node cluster. Here, in this paper we try to perform the analysis on multinode cluster which is installed at Amazon cloud.
KeywordsBig Data Hadoop HDFS MapReduce Apache Tez Apache Pig Apache Hive
The age of Big Data has begun. Data on servers increased very rapidly and current technologies unable to retrieve some useful information from already stored data . These complex data sets require new technologies so that some useful information is retrieved in timely manner. Many companies invest millions on research to overcome challenges related to Big Data. Apache Hadoop is among the technologies to handle Big Data and it is an open source project maintained by many people around the world . Apache Hadoop foundation has developed many components with different versions. Hortonworks Data Platform is an organization which provides single platform for all the hadoop components . HDP provides us options to install hadoop on different platforms like Microsoft Azure, Amazon cloud, Local site or on own network. Apache Pig is among one of the core components of the hadoop ecosystem. It accepts jobs submitted in the form of scripts. Pig script is saved like notepad file and it is processed line by line using MapReduce or Apache Tez framework. User may choose any framework to run particular pig script. In our previous paper we compare both the frameworks in both theoretical and empirical way on the basis of some parameters. We perform our experiment on the single node cluster and also put more effort on theoretical parameters. Here in this paper we put emphasis on both theoretical empirical parameters and try to analyze that how these two frameworks react when particular job is submitted to multinode cluster installed on amazon cloud.
Firstly we take closer look at the hadoop ecosystem and some of its components. Then we try to explain parameters used for analysis of both the frameworks. After all the theoretical explanation we try to put some light on Dataset used and experimental setup required for running of Apache Pig Script. We run our script on multinode cluster installed on AWS cloud. Then all the results were shown in the form of graphs and tables. At last we conclude our paper by giving some idea about work yet to be done.
Apache Pig is a tool that is used to pre structure the data before hive uses it . It is used for analyzing and transforming datasets. Apache Pig uses procedural and scripting language. Pig job is a series of operations processed in Pipelines and automatically converted into MapReduce Jobs. Pig uses ETL (extract transform model) while extracting data from different sources . Then pig transforms it and stores into HDFS. Pig scripts run on both MapReduce and Apache Tez frameworks. User has three choices to submit pig jobs by grunt shell, UI or java server class.
A few years back we require the single machine for the processing of the larger datasets. Processing data on bigger machines is called scaling up. But this scaling has many bottle necks due to financial and technical issues. To solve this problem the concept of cluster of machines is introduced and this is known as scaling out. To make the concept of distributed processing feasible we have to write new programs. MapReduce is a framework which helps in writing programs for processing of data in parallel across thousands of machines . MapReduce is divided into two tasks Map and Reduce. Map phase is followed by the Reduce phase. Reduce phase is always not necessary. MapReduce programs are written in different programming and scripting languages.
Distributed processing is the base of hadoop. Hive and Pig relies on MapReduce framework for distributed processing. But MapReduce is Batch Oriented. So it is not suitable for interactive queries. So Apache Tez is alternative for interactive query processing. It is available in 2.x versions of Hadoop . Tez is prominent over map reduce by using hadoop containers efficiently, multiple reduce phases without map phases and effective use of HDFS.
Difference between MapReduce and Apache Tez on the basis of different parameters
Types of queries
MapReduce supports batch oriented queries 
Apache Tez supports interactive queries
MapReduce is the backbone of hadoop ecosystem and Apache Pig relies on this framework
Apache Tez also works for Apache Pig but it is very useful in interactive scenarios
MapReduce always requires a map phase before the reduce phase
A single Map phase and we may have multiple reduce phases
MapReduce is backbone of hadoop available in all hadoop versions
Apache Tez is available in Apache Hadoop 2.0 and above
Slower due to the access of HDFS after every Map and Reduce phase
High due to lesser job splitting and HDFS access
Temporary data storage
Stores temporary data into HDFS after every map and reduce phase 
Apache Tez doesn’t write data into HDFS, so it is more efficient
Usage of hadoop containers
MapReduce divide the task into more jobs. So more containers required for more jobs
Apache Tez reduces this inefficiency by dividing the task into lesser no of jobs and also by using existing containers
Apache Tez and MapReduce are two frameworks used by Apache Pig in analysis of particular Dataset . These two frameworks have their own merits and demerits.
Firstly, we discuss about the dataset used in our experiment. Then we explain the experimental setup used for processing of our dataset. At last we discuss the results of analysis of data after running Apache Pig script on both the frameworks.
Datasets used in experiments
No of records
No of attributes
Experimental results and metrics
Effect on runtime with increase in configuration of cluster nodes
In our case we have n = 10 and by picking the different values of x from Fig. 8, we got average time of 27,710.7 ms for Apache Tez and 60,713.7 for MapReduce, This shows that MapReduce takes almost double time then Apache Tez.
No of jobs
No of containers required
Showing difference on the basis of some parameters
No of jobs
No of containers
1st job = 2 containers
1 job = 3 containers
2nd job = 3 containers
Conclusion and future work
This paper briefly explains both the frameworks used for execution of Pig Scripts. We try to perform both theoretical and empirical analysis on the basis of some parameters. With the help of chosen parameters we are able to understand that how these frameworks differ from each other. Results show that Apache Tez is a better choice for execution of Apache Pig scripts as MapReduce requires more resources in the form of time and storage. But MapReduce is also the backbone of hadoop ecosystem and can be used efficiently in various scenarios.
In future we try to go into more detail of Apache Tez framework and try to explore new things so that it becomes more efficient. Lots of work is yet to be done on this framework.
RS performed the primary literature review, data collection, experiments, and also drafted the manuscript. PJK worked with RS and helps in analyzing the frameworks. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Big Data, http://www.sen.wikipedia.org/wiki/Big_data.
- Hadoop, http://www.apache.org.
- Hadoop, http://www.hortonworks.com.
- Ouaknine K, Carey M, Kirkpatrick S. The Pig mix benchmark on Pig, MapReduce, and HPCC systems. In: 2015 IEEE international congress on Big Data (BigData Congress); 2015. p. 643–8.
- Bansal SK. Towards a semantic extract-transform-load (ETL) framework for Big Data integration. In: 2014 IEEE international congress on Big Data (BigData Congress); 2014. p. 522–9.
- Maitrey S, Jha CK. Handling Big Data efficiently by using MapReduce technique. In: IEEE international conference on computational intelligence & communication technology (CICT); 2015. p. 703–8.
- Singh R, Kaur PJ. Theoretical and empirical analysis of usage of MapReduce and Apache Tez in Big Data. In: Proceedings of first international conference on information and communication technology for intelligent systems: volume 2, smart innovation, systems and technologies 51, doi: 10.1007/978-3-319-30927-9_52.
- Ravindra P. Towards optimization of RDF analytical queries on MapReduce. In: IEEE 30th international conference on data engineering workshops (ICDEW); 2014. p. 335–9.
- Fuad A, Erwin A, Ipung HP. Processing performance on Apache Pig, Apache Hive and MySQL cluster. In: 2014 international conference on information, communication technology and system (ICTS); 2014. p. 297–302.
- Azzedin F. Towards a scalable HDFS architecture. In: 2013 international conference on collaboration technologies and systems (CTS); 2013. p. 155–61.
- Gates AF, Dai J, Nair T. Apache Pig’s optimizer. IEEE Data Eng Bull. 2013;36(1):34.Google Scholar
- Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava U. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proc VLDB Endow. 2009;2(2):1414–25.View ArticleGoogle Scholar
- Cluster, http://www.amazon.com.
- Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R. Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th international conference on data engineering (ICDE). New York: IEEE; 2010. p. 996–1005.
- Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow. 2009;2(2):1626–9.View ArticleGoogle Scholar