Hybrid gradient descent spider monkey optimization (HGDSMO) algorithm for efficient resource scheduling for big data processing in heterogenous environment

Big Data constructed based on the advancement of distributed computing and virtualization is considered as the current emerging trends in Data Analytics. It is used for supporting potential utilization of computing resources focusing on, on-demand services and resource scalability. In particular, resource scheduling is considered as the process of resource distribution through an effective decision making process with the objective of facilitating required tasks over time. The incorporation of heterogeneous computing resources by the Big Data consumers also permits the option of reducing energy usage and enhanced resource efficiency. Further, optimal scheduling of resources is considered as an NP hard problem due to the dynamic characteristics of the resources and fluctuating users’ demand. In this paper, a Hybrid Gradient Descent Spider Monkey Optimization (HGDSMO) algorithm is proposed to efficient resource scheduling by handling the issues and challenges in the Hadoop heterogenous environment. The proposed HGDSMO algorithm uses the Gradient Descentand foraging and social behavior of the spider monkey optimization algorithm involved in the objective of effective resource allocation. It is designed as the efficient task scheduling approach that balances the load of the cloud by allocating them to appropriate VMs depending on their requirements. It is also proposed as a dynamic resource management scheme for efficiently allocating the cloud resources for effective execution of clients’ tasks. The simulation results of the proposed HGDSMO algorithm confirmed to be potent in throughput, load balancing and makespan compared to the baseline hybrid meta-heuristic resource allocation algorithms used for investigation.

since it is a time-consuming and labor-intensive task and hence, stretches existing infrastructure to its limits. Many studies are emerging now-a-days to explore the possibility of using cloud computing paradigm for Big Dataprocessing [2]. Those works are driven by a fact that the Big Data Processing requires scalable and parallel computing resources rather than using on-hand database management tools or traditional data processing applications. The core aim of scheduling in Big Data Processing completely focuses on the plan of processing and completing diversified tasks as much as possible based on a restricted number of data handling and alteration achieved in an effective manner. In general, different methods are highly preferable for resource allocation since they possess specialized architectural properties. In this context, identifying the best scheduling method for each and every specific data processing is considered as the important challenge [3]. This challenge is even more complex as the Big Data processing is considered as the largest batch tasks that run over a High Performance Computing (HPC) cluster by partitioning a job into smaller tasks for the purpose of distributing the work to the cluster nodes. However, the Big Data processing models need to be aware of the locality in which the data resides under the event of transferring the data to the nodes used for computation. Currently, the jobs are practically allocated to each computing node based on the two processes. The two processes are processes of investigating realistic contexts of utilizing resources and the static or dynamic scheduling enforced in Map Reduce clusters. Further, the job scheduling process can estimate the resource utilization associated with each allocated jobs, which may not be achieved by investigating the completed jobs.
The area of Big Data possesses a diversified number of research challenges that includes data volume handling, gig data analysis, security and data privacy, data visualization, storage, fault tolerance, job scheduling and energy optimization. The Big Data analysis is more difficult due to the heterogeneous and incomplete nature of data product. This challenge is also due to the availability of different structures, variations and formats of the collected data [4]. Furthermore, the process of dynamic scheduling of jobs with demands of distributed computing also necessitates resource scheduling across diversified geographical areas. In specific, Hadoop uses round method of scheduling when the number of smaller priority jobs is comparatively higher than the number of higher priority jobs. The scheduler also enforces weight and dynamically update rules based on the estimation of the situations. This Hadoop platform uses the aforementioned approach for job tracking and task allocation in the Big Data heterogeneous environment. However, Hadoop scheduler suffers from performance degradation under heterogeneous environments for the purpose of resolving the limitations that are more common in job tracking and task allocation. At this juncture, LATE (Longest Approximate Time to End) is utilized in the Hadoop environment for introducing high robustness under task scheduling in the heterogeneous environment [5]. It is identified that, the response time of the Hadoop increase with the incorporation of LATE scheduler. The method of delay scheduling is a method inherited in Big Data for enhancing the locality of data in an effective and efficient way. This delay scheduling process also concentrates on improving the throughput for the diversified type of tasks. It also ensures fairness in processing and completing tasks based on its policies and simplicity involved in sharing the resources in the Hadoop heterogeneous environment.
In this context, computing services are considered to possess virtual data centers that are highly optimized for facilitating software, hardware and information resources for utilization depending on the demand requested from the users [6]. However, Hadoop environment handles the fluctuations in workload and enables resources for computing and managing huge amounts of multimedia data and multimedia development environments in most of the circumstances [7]. Moreover, the increase in the number of more and larger datacenters introduces new challenges at the infrastructure management and monitoring level [8]. At this juncture, resource scheduling is determined to be a vital procedure for making conclusive decisions on the distribution of resources over time [9]. However, these resource scheduling problemsalso pose certain challenges due to the dynamic characteristics of the resources and fluctuating demands of the users [10]. Moreover, the fitness function computed for resource scheduling concentrates on the perspective of the objectives associated with users and providers [11]. From, the providers' dimension, they need to improve resource utilizations with resources available in the environment for focusing on the increase in profit and revenue growth [12]. On the other hand, users concentrate on the process of deriving maximum performance from their requisite services with reduced expenditure and cost [13]. In the literature, most of the resource scheduling algorithms were identified to consider the parameters of cost, load balancing, availability, throughput, reliability, makespan, energy and fault tolerance for confirming optimal scheduling and utilization process. In addition, most of the meta-heuristic resource scheduling algorithms proposed in the literature are identified to be potent in local searching or global searching process [14]. Thus, the hybridization of a potential local searching approach with the globally capable global optimization schemes is essential [15]. In this proposed scheme, a Hybrid Gradient Descent Spider Monkey Optimization (HGDSMO) algorithm is presented by integrating the local ability of Gradient Descent (GD) and global potential of the Spider Monkey Optimization (SMO) algorithm for effective and efficient resource scheduling process in the heterogeneous Hadoop environment.
The major contributions of the proposed HGDSMO resource allocation algorithm are as follows: a) Hybridization of Gradient Descent and Spider Monkey Optimization Algorithm for the optimal resource allocation process in heterogeneous environment. b) Formulation of objective functions for optimal resource scheduling through mathematical models that derive throughput, makespan and imbalance degree. c) Design of HGDSMO resource allocation algorithms for the purpose of addressing the issue and challenges inherent with the proposed scheduling models. d) Implementation of the proposed HGDSMO resource allocation algorithm using the Hadoop simulation tool. e) Performance Investigation of the baseline meta-heuristic algorithms with the proposed HGDSMO algorithm by including the matrices of throughput, imbalance degree and makespan.

Related work
In this section, the complete categories of Meta-heuristic resource scheduling approaches [8] contributed over the decades are presented in Fig. 1. This section also presented the comprehensive review of the state-of-art Meta-heuristic resource scheduling approaches proposed in the recent years with an extract of literature depicting the shortcomings of the literature. A Cuckoo Search meta-heuristic algorithm-based resource scheduling was proposed for heterogeneous environments [16]. This CS-based resource scheduling scheme used the factors of throughput, makespan and response time for estimating the performance. Then, an enhanced Multi-Objective Cuckoo Search Optimization (MOCSO) Scheme was proposed for handling the issues of resource scheduling in heterogeneous environments as portrayed in Fig. 1. The core objective of MOCSO scheme focuses on minimizing the cost of the user and improving the performance by reducing the time of makespan. The minimization of makespan in MOCSO in turn concentrates on enhancing the profit or the revenue for providers with maximized utilization of resources [17]. This MOCSO approach is also potent in solving resource scheduling issues through the formulation of multi-objectives in Big Data. It balances the multi-objectives through the determination of two factors that includes expected completion time and expected cost for computing completion matrices. A Global League Championship Algorithm (GLCA) was proposed for the effective task of resource allocation through the enforcement of scientific applications [18]. The simulation results of the GLCA scheme derived through Hadoop simulator confirmed its potential in enhancing the makespan to a maximum level of 14.44-46.41%. This GLCA scheme is also proved to be reducing the response time under effective resource scheduling processes. This GLCA scheme also confirmed a superior performance over the compared Genetic Algorithm, Min-Min, Ant Colony and Max-Min optimization algorithm-based scheduling approaches. A resource scheduling algorithm-based on Discrete Symbiotic Organism Search (SSOS) was proposed to improve the task of load balancing and resource sharing in a heterogeneous environment [19]. The characteristic properties of SSOS such as commensalism, mutualism and parasitism were contributed to achieve the objective of optimizing resource scheduling. This SSOS technique was identified to possess faster convergence rate in order to make it adaptable to large scale scheduling issues. An integrated PSO and ACO-based task scheduling optimization scheme was proposed in a Hadoop heterogeneous environment for handling the inadequacies realized under the deployment of Heterogeneous environment [20]. This integrated PSO and ACO-based resource scheduling scheme was determined to be potent in sustaining the particles under the fitness level with specific concentration. This PSO-ACO scheme was capable in guaranteeing the population diversity to the maximum degree for achieving global best solution. A Genetic Algorithm-based resource allocation technique was proposed for reducing the completion time of the task through the enforcement of potent allocation decisions [21]. The simulation experiments conducted for this GA-based resource allocation technique using Hadoop tool confirmed its predominance over the simple and greedy resource allocation schemes in terms of throughput and makespan.
A hybrid GA and PSO-based resource allocation approach was proposed, based on the principle of on-demand queues [22]. This GA-PSO algorithm incorporated the process of analyzing the new tasks by storing them into the on-demand queue. It also computed the priority that played the vital role in synchronized allocation of tasks into hosts or VMs. A Mean Grey Wolf Optimization Algorithm (MGWO) was proposed for improving the performance by eliminating the issues that are highly related to scheduling in the Hadoop system [23]. The objective function of MGWO algorithm concentrated on reducing the energy consumptions and makespan under resource scheduling processes. An Integrated Harmony Search and Cuckoo Search (IHS-CSO) algorithm-based resource scheduling was proposed to sustain balance between the exploitation rate and exploration related to availability of hosts or VMs [24]. This IHS-CSO scheme used energy consumptions, credit, penalty, memory usage and cost for devising objective function. It was confirmed to be superior than the standalone Gravity Search, Cuckoo Search (CS) and Harmony Search (HS) Schemes. In addition, a Hybrid Gradient Descent and Cuckoo Search Optimization (HGDCSO) algorithm was proposed for resource scheduling based on makespan, throughput and degree of imbalance in the heterogeneous environment [25]. This HGDCSO scheme was proposed for mutual integration of gradient descent and cuckoo search optimization algorithms which are capable in exploitation and exploration respectively. The parameters of makespan, throughput and degree of imbalance were used by the HGDCSO scheme for the formulation of the objective function. The number of migrated tasks during the deployment of the HGDCSO scheme was identified to be comparatively better than the existing approaches independent to the number of tasks and VMs available in the Hadoop.
The resource management scheme for Big Data processing is proposed for distributing the loads in the cloud environment [26]. It used the algorithm with the aspect of load balancer matching that derives the input based on the demands of the Big Data under processing. It was proposed a management algorithm that could work well in cloud computing scenario independent to the number of VMs and physical host machine.
It was proposed to address the issue of low availability associated with computational significances and energy. It was also identified to enhance the response time associated with the Big Data processing tasks. A Localization identity and dynamic priority-based hybrid scheduling algorithm was proposed for concentrating on the scalability issue of data locality rate with reduced completion time [27]. It was compared with default schedulers of Hadoop such as FIFO and fair that are capable of executing concurrent workloads that incorporates benchmarks of Terasoft and Word count. It was determined to rapid on par with the fair scheduling and the FIFO methodology for ensuring maximized data locality rate by preventing the resource wastages. A Binary Particle Swarm Optimization (BPSO)-algorithm based resource management scheme was proposed for handling the demand of higher computational power as it is the major challenge to energy requirement in a cloud scenario [28]. It was proposed for distributing the load of Big Data processing among the VMs in a fair manner by optimizing QoS parameters that satisfy the end user goals. It inherited the merits of modified transfer function that facilitated balanced exploitation and exploration during the process of optimizing QoS parameters. However, still there exists a room for improvement.
A Resource allocation framework was proposed for outsourcing the incoming Big Data tasks to the external clouds [29]. This architecture never used any inter-cloud agreements that are formulated for the federation of the clouds. It was proposed for benefitting the allocation of user tasks that marched towards the maximization of profits that in turn ensured a high degree of QoS. It was proposed as a reliable model of integer programming that is capable internally for updating itself towards the requirements of Big Data processing. It was also identified to minimize the computation time predominantly, but failed to handle it in the event of a large number of requests incoming to the cloud environment. In addition, An autonomic resource management approach was proposed for ensuring intelligent QoS in the event of Big Data processing [30]. This autonomic resource management approach supported resources that offered self configuration of applications, which are self healing in properties. It was proposed to provide self protection and sudden failure with resistance against security attacks and self optimizing characteristics. It also confirmed better performance at execution time, contention of resources, SLA violation and response time.

Extract of the literature
The aforementioned review conducted over the existing works of the literature confirmed the following limitations as listed as follows.
i. The majority of the existing approaches maximized data locality degree, but they are not able to reduce the time of job completion through the inclusion of higher data locality rate. ii. Most of the reviewed approaches proposed using PSO, GA, ACO and ABC revealed that they have the limitation of trapping in the local point of optimality during the process of scheduling. iii. The existing methodologies proposed with their own merits also possess some limitations with respect to the aspect of balancing the tradeoff between local search and global search during the process of resource management and scheduling.

Problem formulation and the mathematical model considered for resource scheduling in the proposed HGDSMO algorithm
Scheduling is the process of assigning the starting and completion times for the set of operations to be performed. Similar to other scheduling issues, resource scheduling is a method applied for effective distribution of potential resources that generally includes networks, processors, virtual machines and storages. The resource scheduling is responsible for satisfying the demands of the users under the interaction with the providers. It is highly suitable for load balancing for the purpose of facilitating the uniform distribution of resources based on the demand of the users and to provide some priority through the formulated set of rules. It also needs to ensure a heterogeneous environment, which is significant in serving the requests of the users with some specific quality of service. The problem of resource scheduling is better portrayed with the aid of Eq. (1).
where, ' m ' is the number of tasks such as CT (i) = (CT 1 , CT 2 . . . . . . , CT m ) that are assigned to n available physical in the data centers of the Hadoop ( AR (i) = AR 1 , AR 2 , . . . AR n ). Further, the available virtual resources in the Hadoop environment are considered to range between S (i) and N (i) , respectively. In addition, the fitness value of each objective has the probability to be improved for the users. The Hadoop environment is considered in the deployment of the proposed HGDSMO algorithm consists of different data centers. Each and every data centers in turn consist of inter-correlated VMs with diversified specification. If there is a collection of task CT Thus, the core objective of the proposed HGDSMO algorithm is to integrate the approaches of GD into the local searching process of the optimized SMO algorithm for effectively mapping the tasks on VMs with reduced Expected Time of Completion (ETC). Minimized ETC is essential for attaining reduced makespan, stable imbalance degree and enhanced throughput.This proposed HGDSMO algorithm considered the throughput, degree of imbalance and makespan as the objective function for enabling optimal process of resource scheduling in Hadoop. Hence, the fitness value of the proposed HGDSMO algorithm is computed based on throughput, degree of imbalance and makespan as represented in Eq. (4, 5, 6), respectively.

Makespan
where CT Task(i) represents the time of completion associated with a specific task.
In this context, the stabilized and smooth imbalance degree with reduced makespan and increased throughput depicts the vitality of the proposed HGDSMO algorithm.Hence, this proposed scheme focuses on reducing the time to completion of the tasks on VMs for achieving reduced makespan, stable imbalance degree and enhanced throughput.

Methodology
In this section, the description of local and global search optimization algorithms, the primitive structure of Gradient Descent (GD) method, standard Spider Monkey Optimization (SMO) algorithm and the proposed HGDSMO algorithm is presented.

Comparison of local search and global search
A search method which always reaches a similar local optimal solution from the same point of starting is probably considered as a local search method. Likewise, the global search method's performance needs to be less dependent on its initial position. The local search method will focus on the nearby local optima and the global search technique need to be capable of identifying local optima at any particular point in the search space. However, the local search and global search concentrate on estimating a solution that plays a key role in optimizing the criteria of cost. In specific, local search methods initiates the process of exploring the state space at any arbitrary point and iteratively attempts in estimating a better solution evaluated in terms of the cost function. The arbitrary point used by the local search methods can be chosen through the utilization of the huge number of techniques that highly relies on the problem domain and the local search strategy.
Traditionally, the local search techniques are faster compared to the global search methods and they are potential in facilitating quite superior solutions when the step of initialization is adequate to the concerned problem domain. Moreover, these local and global search algorithms are iterative in nature and they always focus on identifying the superior estimated solutions possible at each current iteration. These algorithms provide us complete freedom for selecting the termination criteria. These algorithms aids in providing local optimal points which may incur a much higher cost compared to the global optimal points, which also relies on the initial solution from where the process of exploration is initiated. But ideally, the global search method completely focuses on estimating the best global solution which is attained mostly at the expense of long searching time. However, they are execution and termination when the criteria of termination come across in reality. Some specific instances of this global search includes GA, SA, PSO,ACO, ABC,etc. The local search algorithms do not completely concentrate on search, but it tries to transform itself from a current solution to a neighboring updated solution. This transformation of current solutions by the local search algorithms depends on the initial solution and initial search space. An instance of a local search method is the hill climbing algorithm, which initially starts with a random solution and iteratively attempts to identify the optimal solution by incrementally transforming a single element of a solution. If the transformation of single element from a solution yields a better solution, then an incremental transformation of the elements in the solutions is facilitated for constructing newer solutions. This process of incremental transformations is repeated until the solution could be no more improved. Thus, there exist NP problems in which identifying one definite and one optimal solution is not feasible. From the viewpoint of classification, these NP problems differ from the methods that are generally utilized for locating a solution. At this juncture, any method that explores towards the vicinity of a initial starting point and has the possibility to get trapped in the local optimal point is always considered as the local search method. One best example is the Gradient Descent method. The global method treats the complete feature space as a single unit under the process of estimating the best optimal solution. A suitable example of this category would be the exhaustive search.

The local search method of Gradient Descent
The local search method of Gradient Descent (GD) is the iterative optimizing approach of principal order. GD is primarily utilized for identifying a functions' local minimum, when one moves in step proportional to the negative of the function gradient at the current point of search. In contrast, if one moves in steps proportional to the positive of the function gradient for attaining its local maxima then it is termed as Sleepest Ascent Approach. If a multi-variable function G(x) is unique and differentiable in the neighboring points of r with G(r) identified to decrease rapidly from the point in the direction of the negative gradient of G(r) at the point (r, G(r)) , then it is determined as the method of Gradient Descent. It states that the next position Ѕ to the current position r is defined by Eq. (7).
where, ∇G(r) is the steepest ascent with δ as the weight factor. In this context, the condition G(r) > G(s) needs to be satisfied for ensuring the minor sufficiency level of the weight factor δ . In other words, δ∇G(r) is subtracted from r, since the search process wants to move against the gradient concentrating towards the minimum local point of optimality. Keeping this in mind, a sequence a 0 , a 1 , a 2 , . . . from a guessed arbitrary point a o with the local minimum of the function G is generated based on Eq. (8,9) respectively.
Such that a 0 satisfies the condition G(a 2 ) ≥ G(a 1 ) ≥ G(a 2 ) . . . . . . Thus, the sequence starting from a 0 hopefully converges towards the expected local point of optimality. It is interesting to note that the step size S C is permitted to dynamically change with each and every iteration. With specific assumptions related to the utilized function G and step function S C the process of convergence facilitated by GD towards the minimum point of the locality can be guaranteed.

The Spider Monkey Optimization (SMO) algorithm
The Spider Monkey Optimization (SMO) algorithm is considered as one of the recently proposed nature inspired stochastic optimization method. SMO particularly is determined to be the superior in the category of swarm intelligent approaches. Spider moneys refer to the monkey species which pertains to the category of animals associated with the fission and fusion social structure. These spider monkeys always live in groups and portray intelligent foraging behavior during the food searching process. They facilitate the food searching in diversified directions by an appropriate information sharing process with other members of the group. This SMO algorithm was proposed based on the inspiration derived from the intelligent food foraging strategy followed by spider monkeys. The search space (the complete set of data centers, hosts and virtual machines (VMs)) of the optimization problem (resource scheduling problem) is considered as the food searching area of the spider monkeys. Each solution (the subset of hosts or VMs that has the possibility to be allocated to the tasks) is considered as the spider monkeys' position in the food searching area. The complete collection of solutions (subset of hosts or VMs) that has potential of facilitating resource scheduling is termed as the swarm. Fitness of a solution (the fitness value depicts the availability of hosts or VMs derived using throughput, degree of imbalance and makespan at a specific point of time depending on the number of tasks entering into the hadoop processing environment) refers to the degree to which the spider monkey is nearer to the food source.
This SMO algorithm includes a set of information rules that aids in sharing information and continuous learning (sharing information and continuously learning about the under-utilization and over-utilization of VMs) for potential updating of positions in the complete search space. This SMO algorithm updates its current position (current VM allocated with tasks) with a superior one (new VM to be allocated) based on the experience probability of VMs thresholds in processing tasks in a hadoop heterogeneous environment. It uses four used-defined factors such as maximum number of groups, perturbation rate, local leader limit and global leader limit. Similar to the other metaheuristic approaches, SMO is also initiated with random initial positions of the spider monkeys (the location of hosts or VMs) generated in a uniform manner. The positions of the spider monkeys keeps on updating (the allocations of tasks to the hosts or VMs is updated) in each and every iteration. The complete set of hosts or VMs are partitioned into smaller groups, when the improvement in the global leader is not identified (global threshold limit). The potential VMs or hosts pertaining to each of the partitioned groups constitute the local leader (local threshold limit). However, the number of groups is only one with same local and global leader (local threshold and global threshold limits are the same) during initialization). Moreover, six iterative steps such as local leader phase, global leader phase, Learning phase of local leader, Learning phase of global leader, deciding phase of local leader and deciding phase of global leader in addition to the initialization phase is utilized for improving the allocation of hosts or VMs to the incoming tasks of the hadoop environment. Each of the aforementioned phases has their own objectives that concentrate on the significant execution of the complete SMO algorithm while focusing on the task of resource scheduling. The local and global leader phase of SMO is responsible for generating a new trial position of each spider monkey (position of hosts or VMs). If the currently generated position is superior to the existing position, the spider monkey replaces the old position with the new position and informs all the members of the groups (the allocation probability of newly identified VM is better than the allocation probability of currently used VM). The learning phase of local leadership and global leader is mainly for identifying the potential leader who can control the entire group as well as the divided local groups (identifying the VMs with high allocation probability from the complete set of VMs and determining the VMs with high allocation probability after partitioning them into subsets). In addition, the deciding phase of local leaders and deciding phase of global leader is included for verifying and resolving the issues of premature convergence and stagnation that are quite common in local and whole search space.
A short description about the various phases of SMO is explained as follows.

Local leader phase
The process of updating the current search space which is to be explored in where, r(0, 1) is the randomly value generated for different dimensions of investigation in the search space with as the rate of perturbation included in the search space. In this context, if the fitness value of the spider monkey in the newly generated position is identified to be greater than the fitness value of the spider monkey in the currently existing position, then the newly generated location is utilized. Else, the current of the spider monkey is retained.

Global leader phase
Similar to the local leader phase, the global leader phase is also responsible for updating the current search space. But, the process of updating global leader phase is achieved in a different way. This phase concentrates purely on the updating process of search solution by considering anyone randomly chosen dimension. Thus, the spider monkey (hosts or VMs) that gets the probability of being updated by the randomly chosen single dimension depends on the probabilistic value derived based on Eq. (11).
At this juncture, the trial position is again generated based on Eq. (12).
Then, the fitness value of the existing hosts or VMs and the newly generated position of hosts or VMs are compared in order to identify the best one from the search space for the potential adoption process.

Learning phase of global leader
In this phase, if the position of a spider monkey (host or VM) has the superior fitness value compared to all other monkeys (hosts or VMs) in the population space, then that monkey (host or VM) will be selected as the global leader. Further, if the selected global leader position is not getting updated then the global limit trial is incremented by 1. This global limit trial is the significant factor for keeping track of the number of iterations in which the global leader is not updated.

Learning phase of local leader
The local leaders are selected in this phase for each and every individual group as similar to the global leader by the enforcement of greedy selection. Further, if the selected local leader position is not getting updated then the local limit counter is incremented (10) by1. This local limit counter is utilized for tracking the number of iterations in which the local leader is not updated.

Deciding phase of local leader
In this phase, the local limit counter is utilized for checking whether there is any possibility of premature convergence or stagnation in the group, such that re-initialization of the local leader for possible updating of the search space could be initiated.

Deciding phase of global leader
This phase uses the global limit counter for checking whether there is any possibility of premature convergence or stagnation in the entire population, such that the partition of population into groups could be initiated.

Significance of SMO algorithm
This SMO algorithm essentially utilizes two searching potentialities that are the local limit counter and global limit counter in the local and global search process receptively. The local search of SMO is applied very insensitively with approximately 1 5 th of the search time and the global search process for the remaining 4 5 th of the search time. This clearly depicts that the search is explored more effectively on the global scale to the maximum degree of 80% compared to the 20% of the exploitation process facilitated in the local scale. This SMO algorithm is determined to a significant method due to the potential reduction in convergence rate and the number of iterations. It eliminates the issue of decreased output data by increasing the number of input data and the number of computations by conducting local search and global search of the SMO algorithm in the simultaneous point of time. The aforementioned reasons form the justification behind the motivation for the researchers to focus in this optimization process for the objective of deriving optimal results.

HGDSMO algorithm
The existing meta-heuristic approaches proposed in the literature for optimizing the problems of resource scheduling in heterogeneous Hadoop is considered to possess some shortcomings that lead to reduced speed and accuracy in task processing activities. Each of the proposed meta-heuristic approaches of the literature are potent in problem optimization of resource scheduling in some scenario, but not all the time due to its less adaptability in handling the behavioral change introduced by the incoming amount of heavy task load. The unbalanced handling of emerging load entering by the existing meta-heuristic resource handling approaches in the Hadoop heterogeneous environment indirectly impacts the objective function and performance. A considerable number of optimization algorithms were proposed in the literature from the recent past for effective resource scheduling in heterogeneous Hadoop, but not even a single approach is convenient in resolving the complete set of issues that are more common in optimizing the factors that influence the process of resource scheduling. Hence, HGDSMO algorithm is proposed for preventing the limitations that are possible during the process of effective and efficient resource scheduling in the heterogeneous Hadoop environment. This proposed HGDSMO algorithm uses GD for achieving rapid optimization and foraging behavioral capabilities of the SMO algorithm for maintaining global optimum.
The proposed HGDSMO scheme utilizes the sensible organization of local and global search, and controls the searching process through the utilization of switching factor S α . The local search of the proposed HGDSMO scheme is achieved with the computation of GD methodology based on Eq. (7,8) or (9) for evaluating fitness value. However, the global search process of the proposed HGDSMO scheme is facilitated using the computation of Levy flights based on Eq. (13). The levy flights is considered a random walk in which the steps are formulated with respect to the step length, which are potentially distributed based on the distribution of heavy tailed probability. Moreover, the step direction of the levy flights is considered to be random and isotropic.
where, S j i and S j+1 i represents the newly generated solution and current solution in the search space with α ⊕ Levy(δ) as the probability of a transaction.
In the subsequent section, the Pseudocode of the proposed HGDSMO scheme is presented in Fig. 2. The flowchart of the proposed HGDSMO scheme is highlighted in Fig. 3.
The first step of the proposed HGDSMO scheme is the initialization phase in which the number of hosts or VMs are randomly initialized. Fitness quantifies the availability degree of the hosts or VMs for the objective of task processing. If the availability of the hosts or VMs changes depending on the number of tasks to be processed, then their availability degree is re-identified.
Further,categorize the number of hosts or VMs into groups based on their degree of availability that reactively changes with the rate of incoming tasks to the Hadoop heterogeneous environment. Re-initialize the groups based on updated fitness value until the maximum number of iterations is over or the termination condition is attained.

Simulation results and discussion
The simulation experiments of the proposed HGDSMO scheme are conducted using Hadoop simulator. The simulation parameters used under the implementation of the proposed HGDSMO and the benchmarked HGDCSO, IHS-CSOand MGWO schemes are portrayed in Table 1. The Hadoop task scheduler used for processing Big Data is suitable in considering the data transmission overhead that exploits the principle of data locality. The time of task completion is influenced by the storage device's speed that are related to Hard Disk Drives (HDDs) and Solid State Drives (SDDs) in which the data is stored on heterogeneous clusters. The poor utilization of speed devices due to the ignorance of the different storage devices' speed is also an important issue that needs to be addressed for classifying storage categories and scheduling strategy.
The key idea of scheduler in a heterogeneous Hadoop environment concentrates on the priorities of different classes for the purpose of minimizing the time of execution. The scheduling process generally enhances the rate of locality by considerably reducing the processing time of tasks mapped through the reduction of network traffic, since it includes minimizing tasks mapped for fetching data in a remote way. The process of managing clusters of Hadoop with multiple numbers of Map Reduce tasks (13) to be computed over multiple nodes also needs efficacy in attaining significant utilization and performance. This process of resource scheduling is also influenced by some specific issues that accounts to energy and synchronization. The Hadoop also necessitates large amounts of energy in processing the data that exists within the data center under which energy becomes the complex issue to be resolved. In addition, the complete cost of energy in the data center increases with parallel reduction in the energy consumption. Synchronization that achieves the process of transferring intermediate outputs of the mapping process to the input of Hadoop is also essential.
The performance metrics used for investigating the proposed HGDSMO schemeand the benchmarked HGDCSO, IHS-CSOand MGWO schemes are throughput, degree of imbalance and makespan as defined in [25]. The performance of the HGDSMO scheme is evaluated in five dimensions such as, (i) mean response time and execution time under varying number of tasks and executable instruction length, (ii) the migrated task count identified with different number of VMs under a constant number of tasks, (iii) the migrated task count identified with different number of tasks under a constant number of VMs and (iv) throughput, degree of imbalance and makespan under a different number of tasks. In the first fold of investigation, the proposed HGDSMO scheme is evaluated using mean response time under varying number of tasks and executable instruction length handled during processing. Figure 4  VMs are initialized) n_best as global_best.

Start
Choose a monkey (i) randomly based on Equation (7), (8) or (9)  proposed HGDSMO scheme analyzed with executable instruction length of 7000 also confirmed a significant increase from 9.15 s to 48.94 s with an increase in the number tasks. In addition, the response time of proposed HGDSMO scheme with executable instruction length of 10,000 also confirmed a mean response time increasing from 10.26 s to 79.12 s under an increasing number of tasks. This vital improvement in mean response time facilitated by the proposed HGDSMO scheme is mainly due to the appropriate estimation of under and over utilization factor used for allocating the VMs during the process of resource scheduling. Figure 5 portrays the mean response time of the proposed HGDSMO scheme estimated under a different executable instruction ranging   instruction length. This improvement in the execution time enabled by the proposed HGDSMO scheme is mainly due to the switching factor and levy flights that alternates between local and global optimization process introduced by the SMO optimization algorithm in resource scheduling.
In the second dimension of investigation, the proposed HGDSMO schemeand the benchmarked HGDCSO, IHS-CSO and MGWO schemes are evaluated based on migrated task count identified with different number of VMs under a constant number of tasks. Figure 8 and 9 demonstrates the significance of the proposed HGDSMO scheme evaluated using a number of migrated tasks visualized under a different number of VMs with tasks count assigned to 500 and 1000, respectively. The number of migrated tasks after the implementation of the proposed HGDSMO scheme was identified to be predominantly reduced due to the utilization of the local and global searching process inherent with GD and SMO approaches. It also aids in marching towards significant allocation of tasks on the VMs independent to the tasks entering into the Hadoop. The number of migrated tasks after the implementation of the proposed HGDSMO scheme evaluated with different number of VMs and tasks assigned to 500, exhibits a phenomenal reduction of 6.52%, 7.18% and 8.65%, compared to the baseline HGDCSO, IHS-CSO and MGWO schemes. The number of migrated tasks after the implementation of the proposed HGDSMO scheme evaluated with different number of VMs and tasks assigned to 1000, also demonstrates a considerable reduction of 6.12%, 7.03% and 8.21%, compared to the baseline HGDCSO, IHS-CSO and MGWO schemes.
In the third dimension of investigation, the proposed HGDSMO schemeand the benchmarked HGDCSO, IHS-CSO and MGWO schemes are evaluated based on migrated tasks' count identified with different number of tasks under a constant number of VMs. Figures 10, 11 depict the predominance of the proposed HGDSMO scheme evaluated using a number of migrated tasks under different number of tasks with number of VMs assigned to 5 and 10, respectively. The number of migrated tasks with the proposed HGDSMO scheme is confirmed to be potentially minimized due to levy flights included in the global searching capability of SMO, since the objective functions are locally and globally exploited and explored independent to the number of tasks to be allocated to a specific number of VMs. The number of migrated tasks determined through the proposed HGDSMO scheme evaluated with different number of tasks and VMs assigned to 5, portrays a phenomenal reduction of 5.76%, 6.94% and 7.62%, compared to the baseline HGDCSO, IHS-CSO and MGWO schemes. The number of migrated tasks of the proposed HGDSMO scheme evaluated with different number of tasks and VMs assigned to 10, portrays a phenomenal reduction of 5.68%, 7.16% and 8.92%, compared to the baseline HGDCSO, IHS-CSO and MGWO schemes.   In addition, Tables 2, 3, 4 demonstrates the makespan, imbalance degree and throughput of the proposed HGDSMO scheme evaluated using under a different number of tasks in the Hadoop heterogeneous senvironment. The results clearly proposed that the makespan of the HGDSMO resource allocation algorithm is estimated to be excellent by 4.82%, 5.78% and 6.94%, remarkable to the compared HGDCSO, IHS-CSO and MGWO schemes. Further, the degree of imbalance of the HGDSMO resource allocation algorithm is estimated to be excellent by 4.98%, 5.89% and 7.14%, remarkable to the compared HGDCSO, IHS-CSO and MGWO schemes. In addition, the throughput of the HGDSMO resource allocation algorithm is also proved to be remarkable by 5.16%, 6.84% and 7.78%, remarkable to the compared HGDCSO, IHS-CSO and MGWO schemes.

Conclusions
The proposed HGDSMO resource allocation algorithm was contributed as a reliable attempt for achieving the objective of resource allocation with reduced energy consumptions and execution time in the hadoop heterogeneous environment. This proposed HGDSMO resource allocation algorithm used makespan, imbalance degree and throughput for the objective function that quantifies the availability of the hosts or VMs to the tasks in the hadoop heterogeneous environment. It is capable in facilitating a suitable balance between the process of exploitation and exploration through the utilization of GD and SMO with levy flights to ensure switching and enforce maximum global optimization between the process. The simulation results and statistical investigation confirmed that the proposed HGDSMO algorithm is excellent than the baseline meta-heuristic optimal resource scheduling techniques proposed for the Hadoop heterogeneous environment. The number of migrated tasks of the proposed HGDSMO algorithm is identified to be superior by 5.32%, 6.78% and 7.98%, excellent to the benchmarked HGDCSO, HCHS and MGWO algorithms with different number of tasks under a constant number of initialized VMs. The number of migrated tasks of the proposed HGDSMO algorithm is identified to be superior by 4.87%, 5.98% and 6.74%, excellent to the benchmarked HGDCSO, HCHS and MGWO algorithms with different number of VMs under a constant number of initialized tasks. It is also planned to formulate a Hybrid Gradient Descent Emperor Penguin Optimization algorithm for investigating