Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.


Introduction
The significance of Big Data has prominently risen across diverse sectors, influencing both business and research domains.Numerous enterprises and organisations engage in the continuous collection of extensive datasets from diverse origins, including the web, sensor networks, social platforms, and various instruments.For instance, major internet companies like Google and Facebook systematically accumulate terabytes of data daily within their data warehouses, leveraging these datasets for purposes such as enhancing website design, detecting fraud, and optimising advertising strategies [1,2].
Meeting the increasing demand to manage large and complex datasets has led to the creation of a diverse array of systems for data analyses.Among these, Hadoop [3] stands out as a widely adopted solution, it is an open-source implementation of the *Correspondence: mohammed.bergui@usmba.ac.ma MapReduce framework [4], originally introduced by Google.This framework aims to provide efficient and cost-effective data processing across clusters of machines.Hadoop offers a functional programming interface that abstracts computations and reduces the effort required for programming, providing a powerful and simple solution for users.It automatically handles the parallelisation and distribution of computing tasks, enhancing system throughput and ensuring fault tolerance.Over time, the Hadoop ecosystem has evolved, comprising a software stack around Hadoop to tackle diverse challenges in Big Data.This ecosystem addresses issues such as unstructured data storage [5], stream processing [6], graph processing [7], query analytics [2], workflow management [8], and machine learning [9], among others.
While programming on Hadoop is relatively straightforward, efficiently managing resource allocation and job execution within a Hadoop cluster presents multiple challenges.Previous works [10][11][12][13] indicate that Hadoop involves over a hundred runtime parameters controlling job executions, including parameters like the number of mappers, number of reducers and container size.Properly configuring these parameters can significantly impact execution time, with reported speed-ups ranging from 2 to 8 times when given the same amount of resources [10].However, optimising Hadoop parameters to minimise job execution time demands a solid understanding of both specific applications and the Hadoop framework.Execution time is impacted by multiple factors such as input data size, code/query complexity, and system environment.The challenge is further complicated by modern computing clusters, which are frequently heterogeneous and managed through virtualisation layers.Consequently, the resource pool is shared among users, especially in the cloud and allocated in units of different types of resource instances, such as virtual machines.This paper aims to address a fundamental question in this context: How can we predict the runtime of a Hadoop MapReduce job within a specified Hadoop configuration and resource setup?Time prediction is of utmost importance in managing systems and resources.Predicting the execution time of jobs provides valuable insights for various purposes, including Hadoop parameters tuning to minimise execution time, establishing the minimum resource needs for a job, devising scheduling algorithms that consider execution time for enhanced resource usage fairness and meeting job deadlines, and identifying anomalies in software or hardware when actual job time deviates from our expectations.The predictive capability empowers us to make informed decisions across these critical aspects, contributing to the overall efficiency and reliability of the system.
Significant research has been done to predict Hadoop job execution time, with a comprehensive overview provided in Section "Related works".The predominant approach in these studies involves model-based prediction [11,[14][15][16][17][18][19][20][21][22], where the objective is to formulate a specific equation for time prediction based on varying levels of Hadoop knowledge.Certain methods [14][15][16] offer equations that explicitly model each phase of operations in Hadoop.Others [11,[17][18][19][20][21][22] rely on statistical counters of overall execution, such as input, output, and intermediate dataset sizes.Machine learning techniques, like linear regression, are then employed to determine suitable coefficient values for different application classes or job types.The literature reports promising outcomes, with over 90% prediction accuracy in some cases [16].
In this paper, we present an attempt to predict Hadoop2.xjob execution time in a shared resource cloud environment with limited network bandwidth using machine learning techniques.Through Google Cloud Platform [23], we deployed different configurations of a Hadoop cluster and ran benchmarks using complex Hive QL queries on structured, semi-structured, and unstructured data.This experiment resulted in an execution history of MapReduce jobs, which we parsed using a Python toolkit that we developed.The result of the experiment is a dataset of Hadoop job traces (more details are provided in our previous work [24]).We then used this dataset to predict the execution time of MapReduce jobs, experimenting with multiple machine learning models such as linear regression and tree-based models.The results of our machine learning models show over 95% prediction accuracy with limited data points.
The remainder of the paper is structured as follows.Section "Background" provides the necessary background on Hadoop.Section "Related works" presents, categorises and discusses existing approaches for predicting Hadoop MapReduce.Section "Methodology" describes our methodology including data collection, data pre-processing, feature selection and engineering and machine learning models.Section "Results analysis" presents the results and shows the effectiveness of our approach.Finally Section "Conclusion and future work" presents the conclusion, limitations and future work directions.

Background
Apache Hadoop, an open-source implementation of MapReduce, facilitates the distributed storage and parallel processing of large datasets across clusters of nodes [3].The core components include MapReduce, Hadoop Distributed File System (HDFS), and Yet Another Resource Negotiator (YARN).HDFS, a distributed file system, divides the input into blocks of equal size, replicated across DataNodes for fault tolerance [25].YARN, the cluster manager, coordinates resource management and job scheduling, supporting various data processing engines like Hive and Spark on the same Hadoop cluster [26].
Each MapReduce job encompasses map and reduce phases, initiating the reduce phase by default when 80% of the map phase concludes [25].Alternatively, the 'slowstart' parameter can be set in the configuration file to allow immediate reduce phase execution after the end of the map phase [27].Figure 1 illustrates the sequential processing steps in both phases.Upon storing a file in Hadoop, it undergoes default partitioning into fixed-size blocks of 128MB.HDFS adopts a master-slave setup, with Slaves/Data Nodes(DN) housing data blocks and the Master/Name Node(NN) managing metadata.Each input split triggers the generation of a map task.Subsequently, the split is subdivided into records, and each record undergoes processing by the mapper.MapReduce orchestrates the processing of input splits through eight phases, including:  The execution time of a job depends on these phases and certain Hadoop configuration parameters influencing the speed of each phase.

Related works
In recent years, numerous studies have targeted the prediction of Hadoop execution times to optimise job executions and resource allocations.In this section, we categorise these studies based on their prediction method and summarise these studies based on their prediction method and accuracy as shown in Table 1.
The majority of existing approaches for predicting Hadoop job execution time are model-based, relying on historical job execution information to build prediction models.White-box Regression equations for MapReduce phases ≈90 Khan et al. [17] Black-box Locally Weighted Linear Regression (LWLR) ≈85 Song et al. [18] Black-box Locally Weighted Linear Regression (LWLR) ≈90 Tariq et al. [19] Black-box Decision tree models ≈88 Zhang et al. [20] Grey-box Hybrid of analytical and linear regression models ≈90 Sangroya et al. [21] Grey-box Regression-based analytical model ≈90 Kadirvel et al. [11] Grey-box Machine learning regression techniques ≈85 Ceesay et al. [22] Grey-box Linear regression and SVM ≈88 Gandomi et al. [28] Grey-box Linear regression ≈92 These model-based strategies can be categorised into white-box [14][15][16], black-box [17][18][19] and grey-box [11,[20][21][22]28] modelling.White-box modelling involves the development of detailed mathematical equations that characterise job execution time based on an understanding of the internal workings of the computing framework.For instance, Verma et al. [14] utilise job profiling to predict MapReduce job completion times by applying the Makespan Theorem, which calculates the lower and upper-performance bounds for map and reduce phases based on their duration and resource allocation.This method enables the prediction of job completion times by analysing the average and maximum task durations and the number of map and reduce slots allocated to each job.Both Gandhi et al. [16] and Zhang et al. [15] propose equations for individual phases of MapReduce execution, employing regression methodology to adjust coefficients in the execution time model for various workloads.Whitebox modelling proves effective in predicting Hadoop performance, given its strict adherence to the MapReduce programming paradigm, allowing the division of execution into distinct phases.However, considering all Hadoop configurations, including buffer size and sort percentage, makes it impractical to formulate an equation comprehensively capturing the performance impact of these parameters.This approach also tends to be inflexible when faced with changes in job characteristics or cluster configurations.
Black-box methods are used for systems that are too complex to be modelled in detail and mathematically.Khan et al. [17] introduced a model that estimates MapReduce job execution time based on historical job execution records, using the Locally Weighted Linear Regression (LWLR) technique.LWLR is a non-parametric instance-based function assigning weights to instances according to their Euclidean distance from the query instance, with a high weight for close instances and a low weight for distant ones.On a similar note, Song et al. [18] proposed a framework for predicting Hadoop job performance.Their approach involves a Hadoop job analyser and a prediction module using LWLR method.The job analyser samples input data, runs map and reduce functions, and collects relevant features, estimating complexities and data ratios.The LWLR model assigns weights to features, generates a training set based on weighted distances, and selects similar job types.The training set is then used for model training and predictions.Tariq et al. [19] proposed a machine learning model to estimate resource utilisation such as CPU and memory, and job execution time on Hadoop2.x based on cluster configuration and dataset structure.Decision tree models are trained for predicting the usage of CPU, disk, memory, network and job execution time.Four different datasets are used for training and validation to examine the relationship between data structure, data size, columns involved in the query and the estimated results.These black-box models offer flexibility and adaptability to various job and data characteristics but often lack the interpretability of white-box models.
The grey-box refers to methods in which a combination of internal parameters and system behaviour measurements are used for modelling.Zhang et al. [20] proposed a hybrid approach combining analytical and machine learning models to estimate MapReduce job execution time.Their method involves a job profiler that collects data for each phase of the MapReduce job, and linear regression models are applied to estimate the execution time of individual tasks.Micro-benchmarks on a smaller cluster generate training data.The map task completion time is estimated as the sum of map phase durations, and the reduce task completion time is estimated as the sum of shuffle and write phases.An analytical MapReduce performance model [14] is then used to estimate job execution time, leveraging knowledge about average and maximum task durations to compute bounds on completion time based on allocated resources (map and reduce slots).Sangroya et al. [21] introduced a regression-based analytical model to predict the execution time of Hive queries as data volume increases.The authors computed linear regression models for various sub-phases of MapReduce job execution and then built a consolidated model for predicting execution time.The proposed model considers crucial parameters, including record selectivity, the number of map tasks, and the ratio of output record size to input record size.Additionally, it incorporates the number of map waves as additional sensitive parameters to enhance the prediction accuracy of MapReduce job execution time.Kadirvel et al. [11] developed a grey-box approach that utilises machine learning regression techniques to estimate the performance of Hadoop jobs.The proposed approach includes systematic feature selection and performance prediction under degraded modes of operation that result from component faults.Ceesay et al. [22] proposed parametric models for estimating the execution time of a given MapReduce job for a given cluster of nodes running YARN.The MapReduce code has been edited to allow the collection of the execution time of 6 phases of a MapReduce job (read, collect, spill, merge, shuffle and write).Linear regression performed the best for most of the phases while Support Vector Machine (SVM) performed the best for the shuffle and write phases, these models were then used to predict the competition time for a new MapReduce job.Gandomi et al. [28] proposed a model based on job profiling and log analysis to predict the execution time of MapReduce tasks.Linear regression is used to predict the execution time based on detailed logs of the MapReduce job execution phases.The model accounts for heterogeneity in the cluster by considering the speed of each node and using upper and lower bounds for task execution times.Grey-box models aim to balance the interpretability of white-box approaches with the flexibility of black-box methods, providing a robust framework for performance prediction.
Most of the proposed methods are based on a job profiler.It synthesises the properties of the job and the performance of the map and reduce functions.The job profiler captures properties inherent to the application, such as the selectivity of the map (reduce) function, i.e. the ratio between the output of the map (reduce) function and the input to the map (reduce) function.This parameter describes the amount of data produced as the output of the user-defined map (reduce) function, and therefore predicts the amount of data processed by the other generic phases.The problem with the selectivity parameter is that the output of the map function is unknown at the time the job is submitted, and this information is only available at the end of the reduce phase.We argue that the size of the map (reduce) output should either be predicted before predicting the job execution time or otherwise omitted in the model design.Furthermore, designing a model for each of the eight phases of the MapReduce pipeline will provide a more accurate prediction, but at the cost of a significant overhead when collecting metrics on each phase.

Methodology
This section outlines our approach to MapReduce job execution time prediction in a cloud-based cluster with constrained network bandwidth.We detail the process of generating the dataset, data pre-processing, features engineering, and machine learning model building.This step-by-step methodology aims to unveil the complex relationship between cluster parameters and job performance in resource-limited cloud environments.The following subsections provide clear insights into each stage of our methodology.

Data collection
The foundation of our MapReduce job execution time prediction methodology lies in the comprehensive Hadoop MapReduce job traces dataset, which was meticulously crafted in our previous work [24].This dataset captures vital information about applications, jobs, counters, and tasks in a cloud environment with constrained network bandwidth.
The data collection process involved the deployment of multiple Hadoop cluster configurations on Google Cloud Platform, utilising the Dataproc service for easy cluster management [23].Table 2 provides an overview of the VM types used in our experiments, detailing the number of vCPUs, memory, maximum egress bandwidth, and persistent storage for each type.
Additionally, we experimented with various cluster resource allocation parameters to understand their impact on MapReduce job performance.These parameters, including CPU cores, memory allocation, network bandwidth, and the number of map and reduce tasks, are summarised in Table 3.This comprehensive setup allowed us to capture a wide range of configurations and their corresponding job execution times, forming the basis for our predictive modelling efforts.
A big data benchmark, specifically the TPCx-BB Express Benchmark BB [29], was employed to generate synthetic data with diverse structures and characteristics.This benchmark executed 30 analytic queries, spanning structured, semi-structured, and unstructured data types, providing a rich and varied dataset for our analysis.
Our custom Python toolkit, built on the REST APIs provided by the Hadoop framework, played a crucial role in extracting MapReduce job traces and cluster configuration parameters.Utilising ResourceManager REST APIs and connecting through SSH to access JSON responses and XML configuration files, our toolkit collected comprehensive information about applications, jobs, counters, tasks, cluster metrics, and framework configuration.
The resulting dataset, comprising 3232 rows, serves as a valuable resource for understanding the dynamics of MapReduce job execution in cloud environments with limited network bandwidth.This dataset forms the basis for our predictive modelling efforts, aiming to enhance the accuracy of estimating job execution times and improve overall resource management in cloud-based Hadoop clusters.More details about the dataset collection, including the experimental setup, cluster resource allocation, and a comprehensive description of the features, can be found in our previous work [24].

Data pre-processing
The dataset consists of four distinct CSV files: applications, jobs, counters, and tasks.These files are organised based on various cluster configurations and network bandwidth values.For the purpose of understanding overall job execution time, detailed task-level information was excluded.
To consolidate this data into a unified CSV file, a data integration process was performed.Information from the applications, jobs, and counters files was joined using the unique job ID.This procedure was repeated for each network bandwidth setting and cluster configuration.
The resulting dataset encompasses a total of 3232 rows and 64 distinct features.To prepare the data for analysis, a series of pre-processing steps were executed to ensure its quality and suitability for subsequent analysis.

Data quality assessment
Data quality assessment is a critical process used to evaluate the quality of a database.Its purpose is to identify anomalies and issues within the data, such as inappropriate data types, mixed data types, outliers, and missing data.Due to the rigorous data generation and collection processes, the dataset is free of missing values, data outliers, or mixed data values.During this phase, the focus was primarily on data formatting.Specifically, all features related to data storage sizes or network transfer rates (expressed in GB, MB, KB, Kbps, Mbps) were uniformly converted to bytes or bytes per second (bps).Additionally, features originally expressed in milliseconds (ms) were appropriately transformed into seconds (s).This meticulous data standardisation process ensures consistency and compatibility for subsequent analyses.

Data cleaning
Data cleaning, a crucial data preparation stage, involves addressing missing data and rectifying or eliminating inaccurate or inconsequential entries within a dataset.During this phase, data irrelevant to our study, such as application type and application final status, was excluded.

Feature selection and engineering
Feature selection is the process of determining which variables hold the most significance for the analysis.Our feature selection was informed by prior work in Section "Related works" and research related to MapReduce parameter optimisation [30,31].The chosen features are detailed in Table 4.
Out of the initial pool of 64 features, 17 were retained plus the target.Certain features, such as the number of records processed by all map tasks (map-input-records) were excluded due to their high correlation with the bytes read by map tasks from the local file system (file-bytes-read-map).Additionally, features not available at the time of job submission, such as (file-bytes-written-map), (file-bytes-written-reduce), (hdfs-byteswritten-map), and (hdfs-bytes-written-reduce), were omitted.

total-virtual-cores
The cluster total number of virtual cores.

total-cluster-memory
The cluster amount of total memory in MBs.

total-nodes
The cluster total number of nodes.
node-manager-vcores Number of vcores that can be allocated for containers.
node-manager-memory Amount of physical memory, in MB, that can be allocated for containers.

scheduler-min-memory
The minimum allocation for every container request at the RM in MBs.

scheduler-max-memory
The maximum allocation for every container request at the RM in MBs.

network-bandwidth
The maximum possible egress and ingress bandwidth per node.

dfs-block-size
The default block size for new files, in bytes.

io-sort
The total amount of buffer memory to use while sorting files in MBs.

slow-start
Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.

reduce-tasks/job
The default number of reduce tasks per job.

map-tasks/job
The default number of map tasks per job.

map-memory
The amount of memory to request from the scheduler for each map task.

reduce-memory
The amount of memory to request from the scheduler for each reduce task.

file-bytes-read-map
The number of bytes read by map tasks from the local file system.

hdfs-bytes-read-map
The number of bytes read by map tasks from HDFS.
In addition to the original features, we introduced handcrafted features to capture specific aspects of the Hadoop MapReduce operation.The feature "file-bytes-read-map/ dfsblocksize" was created to account for the number of map tasks required to process the input data.In the Hadoop MapReduce operation, the input data of a job is stored in HDFS as blocks of 128MB.Each block serves as the input for a map task.Therefore, the number of map tasks required to process the input data is determined by dividing the size of the input data by the size of the data block, making "file-bytes-read-map/dfsblocksize" a meaningful feature [32].
Furthermore, the feature "1/NetworkBandwidth(bps)" was introduced to potentially influence the shuffle time.As some map phase outputs are transferred through the network to reducers, the available network bandwidth may play a role in data transmission time and the overall execution time.It is important to note that while this feature loosely relates to data transmission time, the intricacies of the Hadoop MapReduce process make the relationship complex.

Machine learning models
To address the challenge of predicting job execution time in a cloud-based cluster with constrained network bandwidth, we initially explored simpler machine learning models, including simple linear regression and polynomial regression with Least Absolute Shrinkage and Selection Operator (LASSO) and Ridge regularisation.However, these models exhibited suboptimal performance in terms of prediction accuracy.
Our focus then shifted towards more sophisticated models in response to the dataset's complexity and the pursuit of enhanced predictive capabilities.In our research, the primary objective is to identify models that not only deliver accurate predictions but are also fast and lightweight.This is especially important as the predictions are used within a task-scheduling algorithm, which demands rapid computation to ensure efficient job scheduling and resource allocation.
The models we chose for in-depth investigation-decision tree, random forest, and gradient-boosted regression tree (specifically XGBoost)-are selected for their computational efficiency, scalability, performance, and suitability for resource management in cloud environments.

Decision tree (DT)
A decision tree is a versatile model that predicts continuous values by recursively splitting the data into subsets based on feature values, aiming to minimise variance in the target variable.The tree is constructed from the root node down to the leaves.At each node, the algorithm selects the feature and threshold that best split the data into two subsets by minimising a cost function, such as the Mean Squared Error (MSE) for regression.This criterion aims to maximise the reduction in variance.This process is repeated recursively for each child node, typically creating a binary tree structure, until a stopping criterion is met (e.g., a maximum tree depth, a minimum number of samples per leaf, or no further reduction in variance).
For a new input vector x , the input is passed down the tree, following the decision rules at each node, until a leaf node is reached.The predicted value ŷ is the average of the target values in that leaf, as given by Eq. ( 1): where N m is the number of instances in the leaf node, and y i are the target values of those instances.
The splitting criterion is based on minimising the Mean Squared Error (MSE), as shown in Eq. ( 2): where ŷ is the mean target value in the split node.

Random forest (RF)
Random forest is an ensemble technique that improves prediction accuracy and robustness by combining multiple decision trees.The algorithm begins by creating multiple subsets of the data using bootstrap sampling, which involves random sampling with replacement.For each bootstrap sample, a decision tree is built.Each tree is grown using a random subset of features at each split, introducing diversity among the trees and reducing the risk of overfitting.
The Random Forest algorithm operates as follows: It generates B bootstrap samples from the original dataset, each sample having the same size as the original dataset but created by random sampling with replacement.This means some instances may appear multiple times in a bootstrap sample, while others may not appear at all.For each bootstrap sample, a decision tree is constructed.At each node in the tree, a random subset of m features from the total p features is selected.The best feature and threshold from this subset are chosen based on the chosen splitting criterion, such as MSE for regression, to split the node.This process is repeated recursively to grow the tree until a stopping criterion is met, such as a maximum depth or a minimum number of samples per leaf.
For a new input vector x , each of the B trees in the forest provides a prediction ŷ(b) (x) .The overall prediction of the random forest is the average of these individual tree predictions as given by Eq. ( 3): where B is the number of trees in the forest.
Additionally, since each tree is trained on a bootstrap sample, about one-third of the original data is not used in training (out-of-bag data).This OOB data can be used to estimate the prediction error, providing an unbiased error estimate without requiring a separate validation set.Furthermore, random forests can measure the importance of each feature in making predictions.This is done by evaluating the increase in prediction error when the values of a particular feature are permuted.This permutation breaks the relationship between the feature and the true outcome, allowing the model to assess the significance of the feature. (1) Random forests leverage the power of multiple decision trees to improve prediction accuracy and robustness.By averaging the predictions of many trees, the ensemble model reduces variance and mitigates the risk of overfitting, making it a powerful tool for regression tasks.

Gradient-boosted regression trees (GBRT)
Gradient boosting is an ensemble technique that builds models sequentially, with each new model correcting the errors of the previous ones.XGBoost is a popular implementation of this method.The process starts with an initial model (e.g., the mean of the target values).A series of decision trees is then built sequentially.At each stage m, the residuals (errors) of the current model F m−1 are computed.A new decision tree h m is fitted to the negative gra- dient of the loss function (the residuals).The model is updated by adding the new decision tree, scaled by a learning rate ν , as shown in Eq. ( 4): For a new input vector x , the final prediction is the sum of the predictions from all the decision trees, given by Eq. ( 5): where M is the number of decision trees.Each decision tree h m is trained to minimise the loss function L by fitting the negative gradient of the loss function, as described in Eq. ( 6): By iteratively adding decision trees that correct the errors of the previous predictions, gradient boosting produces a powerful predictive model.XGBoost enhances this process with optimisations such as regularisation, which helps to prevent overfitting, and efficient handling of sparse data.

Evaluation
The dataset, comprising 3232 rows, was divided into a training set containing 2262 rows and a test set containing 970 rows, maintaining a 70-30 split ratio.Given the moderate size of the dataset, a conventional train-test split was chosen for model evaluation.
Hyperparameter tuning was performed using Scikit-learn's GridSearchCV utility, which systematically explored the hyperparameter space using 5-fold cross-validation.This method allowed for efficient parameter optimisation while mitigating the risk of overfitting.

Results analysis
In this section, we present a detailed analysis of the results obtained from our experiments on predicting MapReduce job execution time in a cloud-based cluster with constrained network bandwidth.We evaluated several machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees. (4)

Initial visualisation and linear regression
The exploration started with an initial visualisation depicting the relationship between the map's input data (i.e., "file bytes read map") and "elapsed time" under different "total cluster memory" and "total virtual cores" as shown in Fig. 2.
The observed figure illustrates a positive correlation between the size of the input data and elapsed time, indicating an increase in processing time with larger datasets.Furthermore, a noticeable reduction in elapsed time is evident with higher values of cluster vcores and memory.This behaviour aligns with expectations, as larger datasets naturally require more computational time, while increased cluster resources contribute to faster processing.The insights derived from this visual analysis guided the exploration of linear regression.However, the performance of simple linear regression proved inadequate, motivating the pursuit of more sophisticated techniques.
To enhance the modelling capabilities, we employed feature engineering techniques such as crafting additional features and performing polynomial feature crossing (degree 2).The introduction of these engineered features expanded the dimensionality of the feature space, contributing to a more nuanced representation of the underlying patterns.Additionally, data normalisation using Min-Max scaling was applied to ensure consistent feature scales.
The results of this extended linear regression approach demonstrated improvement over the simple linear model.The Mean Squared Error (MSE) was reduced from 41182 to 23796, and the R 2 increased from 0.45 to 0.68.

Tree-based models
Motivated by the limitations of linear regression, we turned our attention to decision trees, random forests, and gradient-boosted regression trees.These non-linear models excel at capturing complex relationships within the data.

Fig. 2 Map input data and elapsed time visualisation
The initial decision tree model, without hyperparameter tuning, exhibited promising results but required optimization.Hyperparameter tuning (see Table 5), specifically adjusting the maximum depth, minimum samples split, and minimum samples leaf, significantly enhanced the model's predictive capabilities.
The tuned decision tree model resulted in an MSE of 1550, Root Mean Squared Error (RMSE) of 39, and an R 2 of 0.98 for the test set.The tuning process seems to have made the model more robust.This is a positive outcome as it indicates that the model is likely to perform well on new, unseen data.Notably, for the training set, the corresponding metrics were as follows: MSE of 690, RMSE of 26, and an R 2 of 0.99.The better results for the training set compared to the test set indicate a slight overfitting, suggesting that the model may be too complex for the given data.
The random forest model we trained (see the results of the hyperparameter tuning in Table 6) further improved predictive performance.The ensemble nature of random forests, combining multiple decision trees, proved effective in mitigating overfitting.
The best random forest model yielded an MSE of 1393, RMSE of 37, and an R 2 of 0.98 for the test set.The training set metrics were as follows: MSE of 965, RMSE of 31, and an R 2 of 0.99.These results suggest that the tuned Random Forest model performs well on both the training and testing datasets.The model generalises well to new, unseen data, as indicated by the comparable performance metrics on the training and testing datasets.Overall, the Random Forest model with hyperparameter tuning seems to be a strong candidate for predicting job execution time in your cloud-based cluster.
The GBRT(XGB) model demonstrated good predictive power.Through hyperparameter tuning, we optimised parameters such as the learning rate, maximum depth, and the number of estimators (see Table 7).The best GBRT(XGB) model achieved an MSE of 3098, RMSE of 55, and an R 2 of 0.96 for the test set.The training set met- rics were as follows: MSE of 113, RMSE of 10, and an R 2 of 0.99.Similar to the deci- sion tree and random forest models, the GBRT(XGB) model also exhibited a more  In contrast, the GBRT(XGB) model, while demonstrating remarkable predictive power, exhibits a more pronounced overfitting effect, indicated by a significant disparity between training and test set metrics.This suggests a need for further model refinement to address potential overfitting concerns.
The feature importance table (see Table 9) provides a comprehensive view of the significance of different input features for the decision tree, random forest, and GBRT(XGB) models in predicting job execution time in the cloud-based cluster.The importance scores are normalized values representing the contribution of each feature to the model's decision-making process.
For the Decision Tree model, "file-bytes-read-map" emerges as the most influential feature, with a substantial importance score of 0.6669, indicating that the volume of map input data strongly influences the prediction.Following closely is "node-managervcores", highlighting the impact of the number of vcores on the node manager.Other features, such as "total-virtual-cores" and "hdfs-bytes-read-map", also contribute meaningfully to the Decision Tree model.
In the Random Forest model, the top three important features are "file-bytes-readmap", "file-bytes-read-map/dfs-block-size", and "hdfs-bytes-read-map", with importance   scores of 0.3430, 0.3414, and 0.1319, respectively.Notably, the Random Forest model places a higher emphasis on the proportion of the size of map input data to the block size, indicating sensitivity to the relationship.The features "node-manager-vcores" and "total-virtual-cores" continue to be influential in this model as well.GBRT(XGB), being an ensemble model like Random Forest, also values "file-bytesread-map" highly, with an importance score of 0.1907.Additionally, "node-managervcores" and "total-virtual-cores" play crucial roles in the GBRT(XGB) model, with importance scores of 0.6151 and 0.1096, respectively.The feature "file-bytes-read-map/ dfs-block-size" also contributes, albeit to a lesser extent.
It is interesting to note that some features, such as "slow-start", "io-sort", and "dfsblock-size", exhibit negligible importance across all models, suggesting that they have limited impact on predicting job execution time in this context.
Additionally, visualizations, such as Fig. 3 depicting true values compared to predicted values from the test set for the tuned models, and Fig. 4 illustrating error comparison, provide insights into the models' performance and the distribution of errors across predictions.

Conclusion and future work
In this study, we have undertaken a thorough investigation into the prediction of the execution time of Hadoop's MapReduce job within a cloud-based cluster environment with limited network bandwidth.Through an iterative process encompassing data collection, pre-processing, feature engineering, and the evaluation of various machine learning models, we have gained valuable insights into the complex relationship between cluster parameters and job performance.
Our analysis revealed that the random forest model emerges as the most effective approach for predicting job execution time, outperforming other models in terms of accuracy and robustness.The significance of features such as the volume of map input data and cluster resource parameters underscored the critical role of data size and resource allocation in determining job performance.Furthermore, our investigation highlighted the importance of rigorous model evaluation and hyperparameter tuning in optimising predictive performance.While models like GBRT(XGB) demonstrated remarkable predictive power, mitigating overfitting remains a key challenge for future refinement.
In future work, we plan to expand our dataset by increasing the number of nodes and incorporating variable network bandwidth and latency to better simulate real-world cloud environments.

Fig. 3
Fig. 3 True values compared to predicted values (Test set) • Record Reader: communicates with InputSplit, transforming data into key-value pairs for mapper reading.• Mapper: processes each input record, generating intermediate key-value pairs.

Table 1
Summary of related works

Table 2
The different VM types for Hadoop cluster deployments

Table 6
RF hyperparameterspronounced overfitting effect which is indicated by the significant disparity between the training and test set metrics emphasising the need for further model refinement.Based on the results summarised in Table8, the random forest model emerges as the top-performing model for predicting job execution time in your cloud-based cluster.With optimised hyperparameters, the random forest model achieves the best values for both the test and training sets.The ensemble nature of random forests proves effective in mitigating overfitting, as evidenced by the comparable performance on the training and testing datasets.

Table 8
Model performance metrics

Table 9
Feature importance for DT, RF, and GBRT(XGB) Bold emphasised values are each model's top 3 most important features Fig. 4 Error comparison (Test Set)