Estimating runtime of a job in Hadoop MapReduce

Hadoop MapReduce is a framework to process vast amounts of data in the cluster of machines in a reliable and fault-tolerant manner. Since being aware of the runtime of a job is crucial to subsequent decisions of this platform and being better management, in this paper we propose a new method to estimate the runtime of a job. For this purpose, after analysis the anatomy of processing a job in Hadoop MapReduce precisely, we consider two cases: when a job runs for the first time or a job has run previously. In the first case, by considering essential and efficient parameters that higher impact on runtime we formulate each phase of the Hadoop execution pipeline and state them by mathematical expressions to calculate runtime of a job. In the second case, by referring to the profile or history of a job in the database and use a weighting system the runtime is estimated. The results show the average error rate is less than 12% in the estimation of runtime for the first run and less than 8.5% when the profile or history of the job has existed.


Background and related work
Hadoop is a data storage and processing platform, based upon two main concepts: HDFS and MapReduce. HDFS (Hadoop Distributed File System) is a file system to provide high throughput access to data and MapReduce is a framework for the parallel processing of large data sets. Hadoop works based on the master/slave style. There is a master node in the Hadoop cluster and many Slave nodes. The Master Node manages, maintains, and controls the Slave nodes, while the Slave nodes are the actual components of the task. The master node only stores metadata while slaves are nodes that store data. Data storage is distributed among clusters. Hadoop framework executes a job in a well-defined sequence of processing phases [1][2][3]. Figure 1 illustrates the pipeline and process phases in Hadoop MapReduce. According to Fig. 1, The following steps are implemented. In brief, they are: [1][2][3]. Read phase: The input data as a map task typically read a block with a fixed size (128 MB) from HDFS.
Map Phase: It applies the user-defined Map function to the data and generates a set of (key, value) pairs. Collect Phase: The output of the Map phase is buffered and sorted. Spill Phase: The results of the Collect step are placed on the disk. Fig. 1 MapReduce processing pipeline [2] Merge Phase: The results of the Spill phase are merged and partitioned. Shuffle Phase: The partitions sort and the data with the same key is sorted. Reduce Phase: The user-defined Reduce function is applied to the previous step data to obtain dense data.
Write Phase: Finally, the output writes on HDFS.
The execution time of a job depends on the above phases also some parameters affect the speed of each phase. Figure 2 shows some parameters that impact each phase of the Hadoop execution pipeline. These parameters and their operations explain in Table 2.
Also, other parameters like the amount of data flowing through each phase, the performance of the underlying Hadoop cluster, the user-defined functions (Map, Reduce), the status of the network and so on affect the runtime.
The aim of this paper is to estimate the runtime of a job in Hadoop MapReduce version 2 by analyzing each phase and investigating the effect of some important parameters. Since being aware of the runtime of a job is crucial to subsequent decisions, there was some researches in this field.
In [6], the authors propose ROUTE, a runtime workload prediction method for MapReduce. It samples the partition size of the early completed mappers and performs estimation at the runtime. In this method, a sampler is developed that determines the minimum number of samples automatically for satisfying the accuracy requirement specified by users. Furthermore, ROUTE requires no apriority knowledge of the map function and input datasets, nor making assumptions on the underlying distribution of the intermediate data.
In [7,8], the proposed model based on the historical job execution records and uses Locally Weighted Linear Regression (LWLR) technique to estimate the runtime of a job. LWLR is an instance-based nonparametric function, which assigns a weight to each instance according to its Euclidean distance from the query instance. LWLR assigns high weight to an instance which is close to the query and low weight to the instances that are far away from the query.
Liu et al. [9] proposed a two-phase regression (TPR) method to predict the finishing time of each job precisely. Detailed data of each job had made with a detailed analysis Fig. 2 Parameters that affect Hadoop pipeline [1,11] report. TPR divided into four steps: data preprocessing, data smoothing, data regression, and data prediction.
Ramanathan and Latha [10] proposed a regression-based performance model to predict the MapReduce job completion time by using the Scale-Out strategy. The authors used this method for Optimizing resource provisioning. They propose five typical situations by varying resources, data size, and time constraints.
Chen et al. [11] used Petri Net to estimate the execution time of MapReduce jobs. They analyzed each phase of MapReduce job execution and investigated the mean delay time of each phase of runtime and combined the delay times from all phase to estimate the execution time of a MapReduce job.
Kozyrev [12] presented a survey of the available techniques for estimating the estimation of the worst-case execution time. The author divided this issue into the overall three categories: static, dynamic, and hybrid approaches. Each case can use this technique: control flow analysis, context analysis, estimate calculation, symbolic computation parameterization of estimates. The author investigated the advantages and disadvantages of each one.
Amannejad et al. [13] proposed a model to quick prediction runtime a job when little cluster resources are available. They used logs from two executions of an application with small sample data and different resource settings and explore the accuracy of the predictions for other resource allocation settings and input data sizes then expanded their model to the Spark waves.
Kecskemeti et al. [14] proposed a technique for predicting background workload in the cloud environment. They simulated a workflow according to the plan specified runtime properties, like job start time, completion time, and times for creating virtual machines. For simulation, they imitated the long-running workflows and investigated behavior discrepancies, then tried to replicate these in a simulated cloud.
Lu et al. [15] proposed and evaluated IoTDeM. IoTDeM was an extended IoT Big Data-oriented model for predicting MapReduce performance. It was able to predict MapReduce jobs' total execution time in a general implementation scenario with varying cluster scales. This model works based on historical job execution and Locally Weighted Linear Regression (LWLR) techniques to predict the runtime of a job. The LWLR model assigns a weight coefficient to each sample of jobs.
Uvaneshwari and Senthil Kumar [16] explained a predictor of the possibility of job completion with the user-specified time. If the user-defined time is not available then a job's runtime estimator is run. For a Map-reduce job that needs to be completed within a certain time, the job profile is built fro the past runtime of the jobs or by executing a smaller block of the job and thereby predicting the expected estimation time of the entire job.

Methodology
To estimate the runtime of a job in Hadoop MapReduce, first, we investigated the anatomy of Hadoop's job and the stages of running a job precisely [1][2][3][4][5]. Since Hadoop works on the repetitive application on the same data type [17], we use the profiling method, which means that there is some separate table in the database for each application. After each job executed, its input data size and the runtime is stored in the database as a history of a job. Tables are updated with new runs.
There may be two situations for estimating the runtime: 1 The job with the specified data size and application run for the first time, and no information is available since its execution in the database. 2 The job with this size and application has already run, and its results recorded in the database.
Each case explains below.

Estimating runtime for the first run
If a job had not already been executed and no information was available from its runtime, it should be estimated. Running a job in Hadoop can be divided into two general phases: Map and Reduce. So, the runtime of a job is the total execution time of the Map and Reduce steps (Eq. 1).
The Map stage includes the following steps [18]: read, map, collect, spill, and merge, so the execution time of the Map is equal to the total time of each step (Eq. (2)).
The Reduce step also includes [18] the shuffle, reduce, and write, so the run time of the Reduce step calculated from Eq. (3). Therefore, the time of each stage must be calculated and summated. Since Hadoop run on a distributed system, many factors and parameters affect T map and T reduce [1][2][3][4][5]22]. So, we investigate the parameters with more impact on runtime.

Parameters affecting the runtime in Hadoop
There are many hardware and software parameters that affect the runtime in Hadoop. In general, three categories of influencing factors are: [11].
1. Hardware factors: Some factors such as the number of nodes in the network, processor power, the core of the CPU, main and secondary memory size and speed, network bandwidth, and the number of Containers affect the speed of running the job. Table 1 shows these factors and the amount specified for testing. 2. The settings of each node: The configuration of each node in the cluster will affect the execution speed of the jobs. Some items such as the size of the data block, the number of Map and Reduce that can be run simultaneously on a node, the buffer size and the replication value have a significant impact on the execution time of a job. Table 2 shows these parameters and the default value for each one. 3. Application Properties: Some items such as complexity of Map and Reduce functions, the size of input and output data and the runtime of each split as a sample of (1)   Output to input ratio in Reduce stage Sel r data, alter the time of the execution of a job. Table 3 shows the notation of the application's parameters.
According to the Hadoop MapReduce operation [1][2][3][4][5], when the request to execute a job is issuing, the input file breaks as predefined blocks (in Hadoop, version 2, the size of each block or split is by default 128 MB) and divided up into nodes, each block is called a task or a map. The number of maps determines by dividing the input data size by the size of the data block (Eq. (4)). Moreover, the Map and Reduce functions are copied to each node. Since the replication value in the Hadoop is the default value of 3, each data has three copies so, data save on a local node or a remote node.
Since the data transmission rate varies between local and remote nodes, and sometimes data transfer done from remote nodes [19], a coefficient called ∂ j is defined to the data access rate (Eq. (5)).
This coefficient is equal to the number of replications (N rep ) divided into the number of nodes (N).

The runtime of the Map stage
Each phase of the Map stage includes the following: a. Read step: The input data is read from HDFS. Since the split of data spread on each node, reading data may be from the local nodes or remote nodes. If data is read remotely, the data transfer rate should also be considered. Data that is read from the hard disk should be written in the main memory [20]. Equation (6) shows the estimated time of data reading.
b. Map step: The Map function runs on the number of map tasks, so the data must be initially read from the memory or the hard drive, followed by the execution of the Map function. The estimation map time of a job determines by sampling time. In this study, the sample size is considered equal to a default split size (128 MB).
Since the Map function is the main function of a program, the complexity of the algorithm is also effective at runtime. In Eq. (7), θ shows the complexity of the Map function. The θ is a contractually coefficient (impact factor) between 0 and 1, which Table 4; its values determine based on the degree of complexity of each application's algorithm. Equation (7) shows the estimated time of the map phase.
c. The spill and merge step: In the spill stage, the output of the map phase where the records have a key-value is placed and sorted in the memory buffer [21]. The buffer size sets according to the value of the parameter sort.mb, where the default value of this parameter is 100 MB. When the memory buffer fills with the threshold value, the data transferred to the disk. The threshold value sets by the spill.percent parameter, which default value is 0.8, that is, 80% of the buffer that filled with the spill action. If more than one file spilled, the merge operation would start. At the same time, ten spill files can merge. This value is adjustable by the sort.factor parameter. The default value for this parameter is 10. In general, the output of the Map stage is constantly sorting and spilling, which also requires merge-operations.
The time spends doing the spill and merge stage depends on the number of times that the spill or merge action performed. The execution of the spill stage depends on the metadata and the buffer size that holds these metadata. By default, 16 bytes for each record is a key-value for storing metadata. Therefore, Eq. (8) shows metadata size, Eq. (9) metadata buffer size, Eq. (10) the data size, and Eq. (11) the data buffer size.  This transfer can be executed locally or remotely. In general, the transfer of data from mappers to reducers is called the shuffle stage.
The output size of the Map stage, which is given to each Reducer, estimates by Eq. (16).
In Eq. (16), sel m is the ratio of the number of outputs of the map stage to its input (Eq. (17)), and N r represents the number of reducers that obtain from Eq. (18).

MapOutputRecords MapInputRecords
Data transfer performs in the Shuffle stage. Equation (19) calculates it based on the amount of data and its location. Also, in the shuffle step, the sort operation is performed, and the data transmitted, so the estimation time of the shuffle stage can be calculated from Eq. (20).
b. Reduce step: after transferring, sorting, and merging data, the Reduce function performs. The time of this step estimated based on the time spent on the tested sample. Equation (21) represents the estimated time at this stage.
c. Write step: At this point, the output of the write step is written to the disk and placed on the HDFS. Equation (22) shows the estimated time of this step.

Estimating runtime of a job that has executed previously
The status of each node affects the runtime of a job. If a node is engaged in a background job, every run may have different times. So, based on the input data size and the type of application, the different runtimes save in a database to use for later estimation. According to the database, the Exponential Averaging method [23,24] is used to estimate the runtime of a new job if the input data size and the kind of application were similar to previous ones.
In the Exponential Averaging method (Eq. (23)), the recent runtime becomes more important, hence giving more weight. This weight indicates by the coefficient α where 0 < α < 1. α is calculated from Eq. (24).
Heap r * shuffle.buffer.percent * shuffle.merge.percent * D shuffle * 1 In this equation, n is the number of sentences sequence in Eq. (23) that calculated from Eq. (24).
In the tests carried out in this study, n = 5 is considered and is calculated up to the fifth sentence of the sequence.

Results and discussion
In this section, we evaluate the proposed method and discuss its error rate percentage.
To evaluate our claim, after estimating the job's runtime by the proposed method explained above, we obtain the actual runtime of each job at the university lab, and finally, investigate the error rate between actual and estimated runtime. RMSE and MAPE, described in Eqs. (25) and (26), respectively, are selected as our evaluation indicators [25][26][27]. RMSE (Root-Mean-Square Error) is a frequently used measure of the differences between values predicted or estimated by a model and the values observed. The MAPE (Mean Absolute Percent Error) measures the size of the error in percentage terms.
They are used as a standard statistical metric to compare the performance of the model for predicting the exchange rate values and to measure the model error.
In Eqs. (25) and (26), T obs is the observation or actual value, T est is the estimated value, and n is the number of observations.
• WordCount-to count the number of words in a text.
• Sort-to sort the input files. • Inverted index-to look up the set of documents that contains a given the word or a term that is used often in text processing.
All tests do in the university laboratory on 13 systems with the specifications in Table 5. Of the 13 systems, one system as Master, and twelve systems consider as Slaves. Hadoop 2.9.1 installed on the systems.
|T obs − T est | T obs

Evaluation of estimating runtime for the first run
If a job runs for the first time and there is not any profile or history about it, we estimate its runtime by proposing a method that explained before.

Fig. 3 RMSE values
For evaluating this method, we need runtime of the sampling values. Table 6 show the runtime of sampling for three benchmarks. Each benchmark run five times, and the average values be considered. Table 7 shows the actual average runtime of a job after five times run and the estimated values by the proposed method for different sizes of data.
According to Eqs. (25), (26) Figs. 3 and 4 show the values of RMSE and MAPE for three benchmarks, respectively. The number of observations for each size of data is five.
As Fig. 4 shows, the error percentage decreases with increasing data size, and this indicates that our proposed method is suitable for big data. The maximum error rate is less than 14%. For more confidence, we repeated the tests by varying the number of nodes and computed the error rate for each test separately by using Eq. (27) [11].    TeraSort inverted index Fig. 12 The percentage error rate on 10 nodes

Evaluation of estimating runtime of a job that has already been run
After each job is executed, based on the type of application and the size of input data, its runtime stored in the database. Each table stores five runtimes. By using Eq. (23) and the information on the database, the runtime of a new job estimated.
For instance, Table 8 shows the runtime history for WordCount on 13 nodes. (There is the same table for each application in the database.) Figures 9 and 10 show the RMSE and MAPE, respectively for three benchmarks. According to Fig. 10, the error percentage decreases with increasing data size, and this shows that our proposed method is suitable for big data. The maximum error rate is less than 12%.
According to Eq. (27), Figs. 11, 12, 13, 14 shows the error rate of the estimated runtime by the Exponential Averaging method and actual runtime in WordCount, Tera-Sort, and Inverted index in the different number of nodes. The results show that the error rate is less than 5%.

Conclusions and future works
One of the popular open-source frameworks for processing big data is Apache Hadoop. To better management of this framework, estimating the runtime of a job is a critical issue.
In this paper, after studying and analyzing the anatomy of processing a job in Hadoop MapReduce precisely to estimate the runtime of a job, we considered two cases: when a job runs for the first time and there is not any history about it or a job has run previously and its profile is available. In the first case, by considering essential and efficient parameters that higher impact on runtime we formulate each phase of the Hadoop execution pipeline and state them by mathematical expressions to calculate runtime of a job. In the second case, by referring to the profile or history of a job in the database and use a weighting system the runtime is estimated. RMSE and MAPE are selected as evaluation indicators. The results show the average error rate is less than 12% in the estimation of runtime for the first run and less than 8.5% when the profile or history of the job has existed.
For future work, we plan to extend the proposed scheme by using machine learning techniques and optimize Hadoop parameter values by using metaheuristic algorithms. Also, we intend to work on estimating Spark job runtime for processing stream data.