There are many scheduling algorithms which address the main issues of MapReduce scheduling with different techniques and approaches. As it has already been mentioned, some of these algorithms have been focused on improving data locality and some aim to provide synchronization processing. Also, many of these algorithms have been designed to decrease the completion time. LATE [16], SAMR [17], CREST [18], LARTS [19], Maestro [20] and Matchmaking algorithms have focused on data locality. What follows is a brief description of some of the most important algorithms: in Longest Approximate Time to End (LATE) scheduler, backup tasks are used for the tasks that have longer remaining execution time. LATE uses a set of fixed weights to estimate the remaining execution time. This scheduler tries to identify the slow running tasks, and once identified, sends them to another node for execution. If this node is able to perform the task faster, then the system performance increases. The advantage of this method is the calculation of the remaining execution time of the task, together with the calculation of the rate of job progress, which leads to an increase in the rate of system response. In contrast, one of the disadvantages of LATE is that the task selection for re-execution is carried out incorrectly in some cases, which is due to the wrong calculation of the remaining execution time of the task. As a result, Chen et al. [17] recommended Self-Adaptive MapReduce (SAMR) scheduling algorithm, inspired by LATE scheduling algorithm. In this algorithm, the history of job executions is used to calculate the remaining execution time more accurately. The task tracker reads the history information and adjusts the parameters using this information. The SAMR scheduling algorithm improves MapReduce by saving execution time and system resources. It defined fast nodes and slow nodes to be nodes, which can finish a task in a shorter time and longer time than most other nodes. It will focus on how to account for data locality when launching backup tasks, because data locality may remarkably accelerate the data load and store. Lei et al. [18] proposed a novel approach, CREST (Combination Re-Execution Scheduling Technology), which can achieve the optimal running time for speculative map tasks and decrease the response time of MapReduce jobs. To mitigate the impact of straggler tasks, it is common to run a speculative copy of the straggler task. The main idea is that re-executing a combination of tasks on a group of computing nodes may progress faster than directly speculating the straggler task on target node, due to data locality. Hammoud and Sakr [19] presented another approach discussing the data locality problem. It deals specifically with reduce tasks. The scheduler, named Locality-Aware Reduce Task Scheduler (LARTS), uses a practical strategy that leverages network locations and sizes of partitions to exploit data locality. LARTS attempts to schedule reducers as close as possible to their maximum amount of input data and conservatively switches to a relaxation strategy seeking a balance among scheduling delay, scheduling skew, system utilization, and parallelism. Ibrahim et al. [20] developed a scheduling algorithm called Maestro to avoid the non-local map tasks execution problem that relies on replica aware execution of map tasks. To address this, Maestro keeps track of the chunks and replica locations, along with the number of other chunks hosted by each node. In this way, Maestro was able to schedule map tasks with low impact on other nodes local map tasks execution by calculating the probabilities of executing all the hosted chunks locally. Maestro, with the use of these two waves, provided a higher locality in the execution of map tasks and a more balanced intermediate data distribution for the shuffling phase.
He et al. [9] proposed a matchmaking-scheduling algorithm, which is close to our HybSMRP. Local map tasks are always preferred over non-local map tasks, no matter which job a task belongs to, and a locality marker is used to mark nodes and to ensure each node has a fair chance to grab its local tasks.
Another algorithm, the one closest to our proposed algorithm, is HybS, which was proposed by Nguyen et al. [10]. HybS is based on dynamic priority, in order to reduce the delay for variable length concurrent jobs, and relax the order of jobs to maintain data locality. Also, it provides a user-defined service level value for quality of service. This algorithm is designed for data-intensive workloads and tries to maintain data locality during job execution.
Wang et al. [21] proposed a map task scheduling algorithm in MapReduce with data locality, which is derived an outer bound on the capacity region and a lower bound on the expected number of backlogged tasks in steady state. Their focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay. Naik et al. [22] proposed a novel data locality based scheduling algorithm which enhances the MapReduce framework performance in heterogeneous Hadoop cluster. The proposed scheduler dynamically divides the input data and assigns the data blocks according to the node processing capacity. It also schedules the map and reduce tasks according to the processing capacity of nodes in the heterogeneous Hadoop cluster. Liu et al. [23] proposed that the available resources in the Hadoop MapReduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. MapReduce adopts a job-level scheduling policy for a balanced distribution of tasks and effective utilization of resources. The current paper introduces a two-level query scheduling which can maximize the intra-query job-level concurrency, and at the same time speed up the query-level completion time based on the accurate prediction and queuing of queries.
Tran et al. [24] proposed a new data layout scheme that can be implemented for HDFS. The proposed layout algorithm assigns data blocks to the high-performance set and the energy-efficient set based on the data size, and keeps the replicas of data blocks in inefficient servers.
The proposed algorithm is a data distribution method, via which input data is dynamically distributed among the nodes on the base of their processing capacity in the cluster.
Recently, several scholarly papers have proposed the use of scheduling models to minimize communication costs. As an instance, an offline scheduling algorithm based on graph models was proposed by Selvitopi et al. [25], which correctly encodes the interactions between map and reduce tasks. Choi et al. [26] addressed a problem in which a map split consisted of multiple data blocks distributed and stored in different nodes. Two data-locality-aware task scheduling algorithms were proposed by Beaumont et al. [27], which optimized makespan. Makespan is defined as the time required for completing a set of jobs from the beginning to the end; i.e. the maximum completion time of all jobs. Furthermore, the scheduling algorithms proposed by Li et al. [28] were aimed at optimizing the locality-enhanced load balance and the map, local reduce, shuffle, and final reduce phases. Unlike the approaches which maximize data locality, the aim of the approach presented in the current paper is to minimize the job completion time through higher data locality rate.
Default scheduler in Hadoop MapReduce
One of the main differences between schedulers in Hadoop and other types of schedulers is different execution time of the tasks on different machines, so that based on the required amount of storage, computation, and processing power, a task may run in n time unit on a machine and in 2n time unit on another. However, due to the considerable data growth in data centers and the characteristics of MapReduce applications, achieving the desired goals is complicated. In Hadoop, jobs should be run by the resources provided by the cluster. Therefore, there is a specific scheduling policy for executing the jobs, according to which jobs are executed based on the available resources. In order to enhance the performance of jobs scheduling, multiple jobs can enter the cluster in the form of a batch and use system resources. To support the parallel execution of jobs, the focus should be on job scheduling mechanism based on resource sharing. There are two types of resource sharing in the MapReduce framework. In the first type, several jobs can use shared computational resources. Computational resources include CPU, memory and disk. The second type is the parallel execution of jobs, which is, in fact, the shared process; this means that a data file is processed by multiple jobs. If the number of tasks is smaller than the number of available slots, it can assign all tasks in a wave to the free slots. However, if the number of jobs is greater than the number of available slots, then this allocation will occur in several waves.
Scheduling is regarded as an important research area in computational systems. This issue in the Hadoop is particularly important for achieving higher cluster performance. Therefore, several scheduling algorithms have been proposed for this purpose [7, 8]. Hadoop has three default schedulers: FIFO, Fair and Capacity. What follows is a brief description of FIFO and Fair algorithms and as well as their positive and negative points.
FIFO scheduler
The FIFO scheduler places applications in a queue and runs them in the order of submission. Requests for the first application in the queue are allocated first; once its requests have been satisfied, the next application in the queue is served, and so on. It will not allocate any task from other jobs if the first job in the queue still has an unassigned map task. This scheduling rule has a negative effect on the data locality, because another job’s local tasks cannot be assigned to the slave node until all map tasks of the previous job are completed. Therefore, if there is a large cluster that executes many small jobs, the data locality rate could be quite low. FIFO scheduler has many limitations such as [6]: Most importantly, FIFO scheduler is not suitable for shared clusters. Large applications will use all the resources in a cluster, so each application has to wait its turn. On a shared cluster, it is better to use the Fair scheduler.
Fair scheduler
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. The objective of the Fair scheduling algorithm is to carry out an equal distribution of computing resources among the users/jobs in the system. The scheduler actually organizes jobs by resource pool, and shares resources fairly between these pools. By default, there is a separate pool for each user. The Fair Scheduler can limit the number of concurrent running jobs per user and per pool. Moreover, it can limit the number of concurrent running tasks per pool. The fair scheduler schedules and organizes jobs into pools, where each pool has guaranteed capacity and the fair scheduler can limit concurrent running job per user as well as per pool. Jobs scheduler schedules the job using either fair sharing or first in first out and assigns the minimum share to pools. The most negative point of this algorithm is that it does not consider the job priority of each node, which could be considered an important disadvantage.