Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Ahmed, Nasim; Barczak, Andre L. C.; Rashid, Mohammad A.; Susnjak, Teo

doi:10.1186/s40537-022-00623-1

Research
Open access
Published: 19 May 2022

Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Nasim Ahmed ORCID: orcid.org/0000-0001-5663-0042¹,
Andre L. C. Barczak¹,
Mohammad A. Rashid² &
…
Teo Susnjak¹

Journal of Big Data volume 9, Article number: 67 (2022) Cite this article

4087 Accesses
7 Citations
Metrics details

Abstract

Due to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.

Introduction

Due to the massive amount of data generated by social media [1], public health [2], industry and natural language processing [3], data storing and processing becomes a challenging task for organisations [4]. The organisations require a fast processing and intelligent system that can quickly process and present the insights of the data. Big data applications have become an ultimate choice in every organisation. There is a number of big data applications available, either in the form of physical clusters or cloud computing. In recent times cloud computing such as Amazon EC2, Google Cloud, Microsoft Azure has attracted tremendous attention. All these platforms allow the users to deploy their cluster virtually where they can choose and allocate resources according to their requirements. This virtualised platform also offers resources at very minimal prices. However, the enterprise needs to consider some data security concerns before selecting cloud computing services. On the other hand, the deployment of the physical Spark cluster is complex and expensive [5]. The physical cluster infrastructures offer numerous benefits and mitigate security concerns.

The deployment of these types of cluster infrastructures heavily depends on distributed parallel computing such as Apache Hadoop and Apache Spark. Due to the open-source, real-time data processing, and fault tolerance [6], Apache Spark has become an attractive framework after Hadoop. Spark supports various components, namely, MLliB for machine learning (ML), GraphX for image processing and Spark SQL [7] for structured data processing. More than 180 configurable Spark parameters play an essential role to support various types of jobs. Though the primary deployment of this cluster depends on the default parameters, however; Spark’s performance heavily depends on its correct parameter selection and their configurations. The user must understand the relationship between the parameters and the cluster hardware availability and requirements because the parameter configuration and achieving optimum performance are always challenging and complex. The cluster parameter configuration is tedious work for the users because it requires a vast amount of time to configure and process data.

Due to this limitation, the performance prediction of this system is very challenging. In order to mitigate these challenges, several prediction models such as trial-end-error [8, 9], analytical [10], machine learning [11,12,13,14] were proposed by researchers but all these models have limitations; hence, in order to predict runtime for a certain job to run in a Hadoop cluster, one can use machine learning regression algorithms or equation fitting. Both methods need a certain amount of empirical data because there is no general analytic method that would cover different hardware and different configurations for a given cluster. In general, ML regression methods need more data to be accurate, specially if the predictions are made by extrapolation. On the other hand, equation fitting can be very accurate with very little data, but only if the equation reflects the true patterns of inter-node communication that emerges from the job execution. It is difficult to find a generic equation for a cluster because even specific algorithm implementations can influence the communication between nodes, and therefore a given forecasting equation can completely break down for a certain application.

In our previous works [15, 16], we have concluded that two parameters are crucial when determining the runtime: the size of the workload, and the number of executors available to run the job. We have tested two main models to generate equations that can fit empirical data. The first model assumed that limited communication happens between the nodes, working only with a certain number of neighbouring nodes. The second model assumed that a fully-connected graph between the nodes reflects the communication pattern. Also, in these models the complexity of the algorithm was taken into consideration. Only two types of workloads were tested in terms of complexity of the algorithms, either they were linear or quadratic when considering the growth of the runtime as a function of the workload size.

The motivation and the key contribution of this paper are as follows:

We accomplished extensive performance prediction accuracy comparison based on machine learning and existing analytical models. We achieved very good accuracy when only limited empirical data is available. Our underlying intention is that practitioners would run a few jobs, preferably with short runtimes, and be able to predict the runtime of longer untested dataset sizes.
We investigated KRR regression parameter relationship between alpha and degree. Our analysis found that, for most of the workloads, the best R-squared can be achieved by selecting the small degree with alpha. Our analysis also found higher degrees can produce best R-squared but the data overfitting can be a major limitation. For the GBR regression, we kept all the parameters default.
We extensively measured the analytical and ML regression models accuracy based on interpolation and extrapolation methods using k-fold cross validation. ML methods are not accurate when one tries to extrapolate predictions from small amounts of data. However, ML methods are much better at making adjustments to existing data, and can do interpolation very well [17]. The equations are derived to fit data well, and using the correct one can yield more accurate extrapolations of runtime forecasts than ML methods.

The remainder of the paper is organised as follows: “Apache Spark architecture” section provides a brief overview of the Apache Spark architecture. “Related work” section discusses some notable recent advances on Spark performance prediction using machine learning algorithms. “Prediction methods” section explains evaluation methods of both the analytical models and the ML regressors. “Experimental setup” section discusses the experimental setup while “Performance evaluations and analysis” section presents the performance analysis for the two approaches using interpolation method with cross-validation technique. “Performance analysis using extrapolation” section shows a detailed analysis of the extrapolation method, splitting the data into two categories, size and number of executors. “Discussion” section discusses the limitations of each approach, and the consequences of extrapolating data with a small number of experiments. Finally, “Conclusion” section concludes the paper with hints for extending the work in future.

Apache Spark architecture

Apache Spark is a parallel data processing framework that can rapidly process large amounts of data, often in real-time [18]. It can also perform data processing in the distributed cluster platform. Apache Spark has become an open access [6] project, and a popular data processing engine in many organisations. Its development has centred at the University of California, Berkeley’s AMPLAB by the group of researchers that Matei Zahari led in 2009 [19]. The Spark codebase is donated to the Apache foundation as an open-source tool and has been maintained since then. In 2010, Spark proposed a Resilient Distributed Dataset (RDD) [20] that mitigates the limitation of the MapReduce cluster computing paradigm. It works as an immutable collector of the objects. RDD splits the input data set into logical partitions and the partition data stored in the memory where worker nodes compute parallel operations. Spark RDD has two operations: transformation and actions. The transformation function uses the existing RDD as input and produces the new RDD from the existing one. Whenever the transformation function becomes active, it creates a new RDD. The action operation activates when it works on the actual dataset. A typical Apache Spark architecture representation is shown in Fig. 1.

The Spark application has three important components: the spark driver program, Spark executor, and the resource manager, where the action operation is forwarded from the executor towards the driver. The driver converts the user code in most tasks, and the executors run the code among the nodes. In this operation, the cluster manager is responsible for the resource allocation in the cluster. The cluster manager allocates the resources whenever the Spark driver program [21] requests and shares the information with the worker nodes. In Spark, the workflow is managed by a directed acyclic graph (DAG) [22]. The DAG consists of sequences of vertices and edges. The vertices represent the RDDs, and the edges represent the operation of the RDD. The DAG forwards the new job towards the stage level. The task consists of the initial input data and the RDD partition at each stage level. Spark creates two stages with the submitted job; firstly, ShuffleMapStage and secondly, ResultStages. At the ShuffleStage, the output data is stored for the following stages in the DAG. At the ResultStage, either single or multiple partitions functions are targeted the RDD. Spark can operate with many programming languages, such as Java, Scala, Python and R, and supports Spark SQL, ML, GraphX processing, and Spark Streaming. These programming language libraries offer comprehensive benefits for the user to develop applications. Spark allows the integration of various tools from the Hadoop technology ecosystem, where the resource management and job scheduling is maintained by Apache YARN (Yet Another Resource Negotiator) [23]. A cluster monitoring tool like Ambari assists with the monitoring the workloads running in the cluster.

Related work

The runtime performance prediction of big data processing on a cluster is a challenging task. In the recent past, many prediction techniques [8,9,10, 24], Gray-Box techniques [25,26,27] and auto tuning techniques [12, 13, 28,29,30] have been proposed by researchers. However, the ML approach has become very popular and has received significant attention. In the following section, we will present recently published works based on ML techniques.

Prediction using machine learning

Douglas de Oliveira et al. [31] proposed an interpretable predictive ML model based on decision trees from which patterns are extracted. The decision tree model is used to classify the parameter performance by considering the training data. They used the extracted patterns and configured the system parameters for the workflow execution that significantly improved the system performance. Besides, they also considered two essential aspects: input data partitions and distributed data partitions through nodes. They found that the proposed predictive model can achieve 70% accuracy, and the accurate data partitioning knowledge can help choose the workflow function.

Christoph Boden et al. [32] presented an interesting work using ML algorithms for a large-scale distributed settings of Apache Spark and Flink performance. They implemented analytical models which are similar to ML algorithms and tuned the parameters to assess the scalability of the system concerning the data size and dimensionality of the data. They carried out a comprehensive investigation based on a single-node implementation with data size and data dimensionality. They found that several ML algorithm problems exhibit high dimensionality due to data scaling and model size scaling. So, they employed both supervised learning algorithms (batch gradient descent and TreeAggregate) for Flink and Spark, respectively. For the unsupervised learning algorithm, kmeans clustering was used. The proposed benchmark algorithm was placed on top of Apache Flink and Apache Spark and analysed the performance using non-representative workloads such as Wordcount, Grep, and Sort. They found that when the data size increased, the system behaviour exhibited a linear increment. They concluded that the system can perform robustly with the increasing data size of both Flink and Spark with 4.6 billion data points. Spark fails to train when the data size is beyond 6 million dimensions for scaling the model dimensionality. They concluded that current data flow systems could process an increased amount of data points but are incapable of coping with high dimensional data, which is a key requirement for large scale ML algorithms.

Christoph Boden et al. [33] proposed a novel data processing system based on a ML algorithm in their second work. This work categorized the proposed data processing system into three major groups: Clustering, Classification, and Recommender Systems. The raw data is transformed into extracted features for the data pre-processing, and the training data set is represented by a numerical data matrix. For this implementation, they have used kmeans, Batch Gradient Descent, and Matrix Factorization algorithms. As per their suggestion, logistic regression is a compelling choice for the prediction problem that can easily handle many data sets. They concluded that the latest data processing system requires more hardware resources to obtain a comparable prediction quality.

Ali Mostafaeipour et al. [34] presented an empirical analysis of the Hadoop and Spark frameworks that considers three criteria such as runtime, memory, and network usages. They implemented the K-nearest neighbour (KNN) algorithm on various datasets for both frameworks. This analysis demonstrated that with small data sets, Spark offers faster data processing than Hadoop. They also found that Spark is suitable for quick data processing because it processes the data in-memory. As for memory utilisation, Hadoop requires less memory than Spark, and Spark requires fewer network usages than Hadoop. Another empirical study of Apache Spark performance prediction based on ML algorithms is proposed by Mehdi Assefi et al. [35]. The authors have examined both qualitative and quantitative attributes of the framework. This study leverages the Apache ML library to handle big data analytics and evaluate the impact of multiple big data ML models such as classification and clustering on the different hardware and software configurations with big data analysis tasks. Some ML algorithms such as Support Vector Machine, Decision Tree, Naïve Bayes, Random Forest, kmeans are evaluated to analyse the ability of MLlib 2.0. They found that Apache Spark MLlib demonstrates better performance; in particular, this presented a noteworthy performance in terms of execution time.

Javaid [36] proposed a robust Spark performance prediction model based on ML algorithms. In this analysis, authors offered substantial experimental works and their applications with various data features. In order to build the performance model, they implemented four ML algorithms. They found that the gradient boost and Random Forest algorithm showed a better performance than the other algorithms on their datasets. In [37], the authors proposed a tool to predict the Spark application runtime before the deployment of the cluster. They claimed that the tool can be used for extensive Spark job profiling, determining the prior execution time and the system bottleneck. They claimed that the tool could predict a 20% error bound for the selected workloads. In [38], the authors proposed a ML-based auto-tune model for cluster parameter selection based on the Support Vector Regression (SVR) model and a practical end-to-end auto-tuning model by combining existing models with a smart search algorithm. They found that the overall performance of ML is much better than the traditional models. In particular, the SVR displayed the best performance for Sort. They concluded that the proposed model is robust and flexible, and adaptable to any changes.

Guoli Cheng [12] proposed a model based on the Adaboost ML algorithm. Adaboost is implemented at the stage level, and the classic projective sampling, including the data mining technique was applied to predict the Spark performance accurately. They used six benchmark workloads and five different data sizes. They concluded that the proposed model minimizes 9% runtime cost as compared to the previous model. In their recently published work [13], they stated that the performance trade-off heavily depends on the optimum configurations where the cost is an influential factor. So, they proposed a multi-object optimization algorithm model based on the Adaboost ML algorithm for Spark performance prediction. They applied six benchmark workloads and five different data sizes to evaluate the system performance. They claimed that the model can find the appropriate configuration setup and minimize the time and cost. They also concluded that the proposed method can improve execution time performance by 30% and cost by 40%.

Table 1 Various models on Spark performance prediction

Full size table

Table 1 summarises some notable studies by considering the models used and their performance based on selected workloads. It can be noted that most of the works used ML and very few works proposed analytical models, but the workloads and model performance metrics are not similar in these works, which make it difficult to compare the accuracy between them. To the best of the authors’ knowledge, the literature has not presented any comparative performance analysis based on standard performance metrics because no standard performance metrics have been recommended.

Unlike the reviewed ML models described in the related work section, we compare analytical models [15] with ML (kernel ridge regression (KRR) [39] and Gradient Boost Regression (GBR) [40]), ERNEST [41], Amdahl [42] and Gustafson [43] models. The runtime prediction based on ML models shows satisfactory performance as per the published work, but all ML models require large input data. On the other hand, we have seen that our published models 2D-plate model (4) and fully-connected (5) model are very effective and can predict runtime accurately with limited data points.

Prediction methods

Machine learning algorithms

Many studies have explored supervised ML models for runtime performance prediction of large systems. These techniques are known as black box solutions because they can make predictions on previously collected data. In this supervised ML model, the training phase uses the experimental data that comes according to system configuration parameters. Indeed, the collection of these data is tedious and requires significant time resources. In this paper, we used two regression algorithms, kernel ridge regression (KRR) and Gradient Boost Regression (GBR), and implemented them based on Sklearn implementation. We choose gradient boost algorithm because it is a popular method for a large cluster setup [44], whereas the kernel ridge regression can perform cross-validation and predictive variance more efficiently on small and large data [45]. In this implementation, the Python programming language is used to evaluate the models’ performance. We introduce the algorithms and their model operation principles in the following section.

1.
Kernel ridge regression

In 2000, Cristianini and Shawe-Taylor [39] proposed the kernel ridge regression (KRR) algorithm. KRR combines ridge regression with the kernel trick. For the linear kernel, this communicates with the linear function in the space induced by the respective kernel and data but for the non-linear kernel, this communicates with the non-linear function. The KRR is a simplified version of the Supervised Vector Regression (SVR), and it is also known as the least square support vector machine (LS-SVM). It uses different loss functions and twelve regularisations. Regularisation always uses positive floating point values, improves the problem complexity, and minimises the estimates’ variance. In KRR, the kernel mapping works internally, and the parameters are passed through the pairwise kernel. A kernel function expressed as: ${K :{\mathcal {X}} x {\mathcal {X}} \rightarrow {\mathcal {R}}}$, is a function that is symmetric to ${K (x_1, x_2)}$ = ${K(x_2, x_1)}$ and positive definite. In this study, we employed the Polynomial kernels from the Sklearn implementation [46]. In the Polynomial kernel [47], the assigned values and its distance calculate as per their assigned values where the parameter values must be positive. We can express the polynomial kernel expression as follows: ${Parameters: \alpha , c, d}$ and the kernel function: ${K(X_{1}, X_{2})}$ = ${(\alpha _{X_{1}^TX_{2}} + c)^d}$.
2.
Gradient boost regression

The Gradient Boost Regression (GBR) algorithm is a popular algorithm used for building predictive models and for large cluster setups [44]. In 2002, Friedman [40] proposed a modified version of the GBR algorithm based on a regression tree of fixed sizes. At the regression problem, the boosting approach works as a form of “functional gradient decent”. The boosting approach is an optimisation technique that minimises the loss function of the training data. In this case, the loss function measures the difference between the predicted values and training data. GBR algorithm generates the learners iteratively by combining the weak learners into a single strong learner. The fixed size of multiple decision trees is used as a weak learner to build the GBR. In this study, GBR is used the default parameters within the sklearn [48] implementation to evaluate the results. The GBR can be used in two ways, either as a regressor or classifier, with the former used in this study to predict the system runtime data.

Prediction models based on specific equations for parallel systems

For any parallel system, including Hadoop clusters running Spark, two parameters are the most influential in determining runtime: size and number of executors. In Spark, other parameters can deteriorate the performance. However, once these parameters relinquish enough resources for running a certain job, they do not have the ability to speed up the execution of that job. Therefore, while most parameters have a minimum threshold for the job to use the cluster’s resources appropriately, one cannot improve the performance of a job beyond a certain point, limited by other factors such as size and number of executors available [9].

Also, in any parallel system the runtime has two components: parallelisable and non-parallelisable portions of time [49]. The parallelisable portion can be found as a function of the size of the job and the number of executors used. The non-parallelisable portion is more difficult as it depends on implementation and communication between nodes.

Since the early days of parallel systems, several models have been proposed for equations that can drive the runtime. Three important ones are Amdahl’s law, Gustafson and Ernest. In our previous works, we have proposed two new models [15, 16] and have tested them against Amdahl [42], Gustafson [43] and ERNEST [41]. In order to compare the models, we adapted Amdahl’s law and Gustafson’s law as equations that determine runtime given the size and number of executors. These models use simple equations that can fit experimental data, and can be used to predict the runtime of jobs for different clusters, with specific hardware.

For completion, we summarise each model and the corresponding equations. For all the equations in this section, the following notations apply:

S is the size of the workload (usually in GB),
f(S) is the function that expresses the runtime complexity of the algorithm,
E is the number of executors, and
a, b, c, d are the coefficients of the equations that need to be found via data fitting.

For f(S), all the workloads used in this work were either linear or quadratic. Therefore, either $f(S)=S$ or $f(S)=S^2$. If the time complexity of the algorithm is known, its equation can replace f(S).

Amdahl’s law

In the early days of parallel systems, Amdahl proposed a performance model where the number of executors and the percentage of the non-parallelisable time drives the speedup of a job running with multiple executors when compared to the same job running on a single executor [42]. The equations can be modified to predict runtime given S and E:

$$\begin{aligned} runtime = a\, f(S)\, \biggl (\frac{(1-b)}{E} + b \biggr ) + c \end{aligned}$$

(1)

Gustafson’s law

Gustafson proposed an alternative model to that proposed by Amdahl [43]. The modified equation to predict runtime is:

$$\begin{aligned} runtime = \frac{a \,f(S)}{E+b\,(1-E)} + c \end{aligned}$$

(2)

ERNEST

More recently, Venkataraman et al. [41] proposed a model specifically for big data clusters called ERNEST. Their equation to predict runtime is:

$$\begin{aligned} runtime = \frac{a \, S}{E} + b \, \, log(E) + c \, \, E + d \end{aligned}$$

(3)

2D-plate model

We proposed a 2D-plate model where the nodes communicate only with its direct neighbours [16] This model was based on insights by Wilkinson and Allen [49] that can be found in chapter 6, sections 6.3.2 and page 180. The details of how we derived equation (4) are in [16]. The equation is:

$$\begin{aligned} runtime = \frac{a \, f(S)}{E} + b \; S \, {E}^c + d \end{aligned}$$

(4)

Fully-connected node model

We also proposed an alternative model where the communication between nodes is assumed to work like a fully-connected graph. Both the 2D-plate and the fully-connected models were as accurate or more accurate than alternative models [15]. The equation for the fully-connected model is:

$$\begin{aligned} runtime = \frac{a \, f(S)}{E} + b \,S^c \biggl ( \frac{E (E-1)}{2} \biggr ) + d \end{aligned}$$

(5)

A special case of this equation was considered when the communication growth is linear in relation to the size S, i.e., $c=1$:

$$\begin{aligned} runtime = \frac{a \, f(S)}{E} + b \,S \biggl ( \frac{E (E-1)}{2} \biggr ) + d \end{aligned}$$

(6)

Experimental setup

All our experiments have been conducted on a high-end Hadoop cluster. In 2016, the group of academicians and researchers designed and developed the cluster at Massey University, Auckland campus. This cluster is designed with a dedicated switch and different network infrastructures, similar to a Beowulf cluster [50]. In order to reduce the network latency and unwanted network resource utilization, all other network machines were isolated from this infrastructure.

A schematic diagram of the cluster is presented in Fig. 2, and the specifications for the servers and nodes are presented in Table 2.

Table 2 Experimental configuration of the Hadoop cluster

Full size table

HiBench workloads

It is a challenging task to evaluate the performance of a cluster. In the recent past, researchers presented CloudSuite [51] and CloudStone [52] benchmarks for cluster performance. Intel also proposed a Hibench suite under Apache Licence HiBench suite [53]. Since then, this has become a heavily used cluster performance testing tool, especially for Hadoop and Spark frameworks. The existing benchmark can be divided into three categories, such as Micro-Benchmarks, End-to-End benchmark, and Benchmark suite [54]. The Hibench suite has also been categorised into four categories: Micro-Benchmark, Web Search, SQL, and ML. In this experiment, there are five different workloads, comprising WordCount, kmeans, SVM, Pagerank, and NWeight. From four categories: Micro-Benchmark, ML, Web Search and Graph were taken into consideration. Table 3 presents the Spark Hibench workloads while Table 4 presents the workload application characteristics.

Table 3 Spark HiBenchmark workload considered for this study

Full size table

Table 4 Workload application characteristics

Full size table

Cluster parameters configuration

In this work, a set of configuration parameters are considered to evaluate the performance of the system. Spark has more than 150 configurable parameters [8, 9] where the system performance heavily depends on the correct parameters selection. Therefore, we have judiciously selected only the parameters that are closely bound to system performance for evaluation purposes. However, cluster performance depends not only on the right parameters selection but also on tuning the parameters to achieve optimum system performance. We have seen the configuration of these parameters heavily depends on the cluster hardware, workload characteristics and size of the workloads. Out of numerous parameters, the most important parameters are the number of executors, executor memory, executor core size, and the driver memory. This experiment has therefore chosen a subset of only impactful parameters and tuned their values to achieve the best cluster performance.

Recently, several notable studies [26, 55] presented the importance and effectiveness of the outlined tunable parameters. Our study revealed that the right parameters selection is the primary requirement to get the best cluster performance. In our work, the chosen parameters are listed in Table 5. It can be seen from Table 5 that the default column presents the system default parameters, we tuned several parameters values including the default values that are listed in the range columns. Our investigation found that in most of the cases, the default values are not appropriate for our cluster performance. In some cases, for example Spark.memory.fraction and Spark.memory.storageFraction, there is no performance difference even if we use lower than the default values. On the other side, for example, Spark.driver.memory, Spark.driver.cores, Spark.shuffle.file.buffer, Spark.reducer.maxSizeFlight, the higher values showed better performance than the default values. Therefore, we have considered only those tuned values that are listed in the column of value used in the experiment column. The description of the parameters is presented in the description column. Our insight on the cluster performance, the selection of these parameters and their values are firstly based on the fact that the spark performance heavily depends on the available resources of the hardware. Secondly, these parameters and their values were chosen since they control pivotal resources such as CPU, disk read and write, and memory. [56].

Table 5 Description of selected Spark configuration parameters selected as the input of the proposed model

Full size table

Performance evaluations and analysis

In this section, we present the comparative results between the analytical and ML models. To validate the system performance, we used five HiBench workloads with various data sizes. The system runtime characteristics are obtained by running jobs for five workloads using different number of executors and data sizes.

The proposed work is categorised into six stages: loading runtime data, data preprocessing, cross-validation and extrapolation methods, proposed models, ML models, and lastly, performance measurements (Fig. 3). To avoid overfitting and selection bias, a threefold cross validation process was used. The workload execution time is extracted from the Ambari history server log files, where a Python script is used to calculate the workload execution times. For the final graph presentation, each experiment was repeated at least three times, and the average time is considered as a final result.

We calculate the job execution time based on the job log files. We have collected all job log files from the Ambari history server and used a Python script to calculate the execution time. We found a fraction of time difference between the Python script and the Ambari server. One of the possible reasons for this time difference, Python scripts calculate log files independently while Ambari saves the execution time into the server, where the network latency can play an important role. In stage two of Fig. 3, data prepossessing is an essential step to achieve the best results from the models. Therefore, well-structured data is required to get the best performance from the models.

In stage three of Fig. 3 two types of data split were used. For the threefold cross-validation, a balanced split was used, where in both training and test data all sizes and number of executors are present in the data. For the extrapolation split, the training data receives all measured points in the middle of the range for either size or number of executors, also using the same proportion of data for the training set, with 66% of the data, and the test set, which uses the remaining 34% of the data.

In stage four, we applied the fitting to the equations of the analytical models (proposed and from the literature), and use two ML regression algorithms, namely KRR and GBR. Finally, the performance of the models and ML regression is measured based on R-squared and Residual Relative Square Error (RRSE). Only the most accurate results have been used to present the graphs in Figs. 6, 7, 8, 9, 10.

Evaluation metrics

The choice of model evaluation metrics is an important factor for conducting comparative analysis. To verify performance in literature, researchers lean on a variety of metrics. In this study, the R-squared $(R^2)$ and relative residual standard error (RRSE) are used. R-squared is used as a dominant index in regression algorithm to verify the predicted results accuracy, and RSE is used to determine the goodness-of-fit. The $R^2$ values (also known as Correlation Coefficient(R)) are presented as follows.

$$\begin{aligned} R^2 = 1 - \frac{SS_{Regression}}{SS_{Total}} \end{aligned}$$

(7)

where $SS_{rs}$ is the sum of the squares of the residuals and and $SS_{tot}$ is the sum of the squares relative to the mean of the data. R-squared value is between 0 and 1. Higher values indicate a more optimal fit. The residual standard error is represented as follows:

$$\begin{aligned} RSE = \sqrt{\frac{\sum _{i=1}^{n}(Y_i-{\hat{Y}}_i)^2}{df}} \end{aligned}$$

(8)

where $(y_{i} - \bar{y_i})$ is the difference between the observed data and the predicted value using the model, and df is the degrees of freedom given by the number of sample size minus and the number of parameters being fitted. The relative standard error (RSE) is:

$$\begin{aligned} RRSE = \frac{RSE}{\mu } \end{aligned}$$

(9)

The Residual Relative Standard Error (RRSE) metric allows us to distinguish the error between the observed points and the ones generated by the model. The smaller the RRSE, the better the fit accuracy.

Kernel ridge models

We used KRR from scikit-learn [46] implementation where only the polynomial kernel is considered, and others are ignored. For the model simplicity, we kept the coefficient and gamma parameter as a $default = 1$, and no other parameters but the different degrees and alpha values are examined to improve the model’s accuracy. It can be noted that the proposed model produced the best R-squared results when $degree = 70$. Table 6 presents the best results of the individual workloads by measuring $R^2$, standard deviation and relative residual standard values (RRSE). Our study found that except for the Kmeans and Graph workloads, all three workloads produce the best results with degree 70 and alpha 1. In contrast, the Kmeans and Graph show the best results with $degree = 30$ and alpha value always achieves a better performance with $alpha = 1$. Our study also revealed that the small values of alpha improves the model performance and reduces the variant of the estimates for three workloads. We noticed that the model performance for the individual workloads is satisfactory. The R-squared comparison of KRR algorithm across different degrees and alphas on the performance of selected HiBench workloads are shown in Fig. 4.

Table 6 Kernel ridge regression algorithm statistical parameters using set of different workloads

Full size table

Despite showing a better R-squared value for higher degrees, polynomial regression has a known issue related to over fitting [57]. We showed the results of higher degrees to make the point that ML approaches using polynomials and easily overfit. One can see that the fitting follows the experimental data very well, and it can encompass the full range of the parameters. However, when showing an example of fitting, one can see that the higher degree polynomial can create instability in the model. In Fig. 5 four KRR fittings show values interpolated between the experimental data for different sizes. The interpolation points are in the middle of the measured data points. What can be clearly seen in the higher degree polynomials is that the interpolated data extends out of the value range of the plot, even though the training and test data are still well within the model prediction. This means that even with a high R-squared value, the prediction of new sizes can be very inaccurate. For this reason, we kept the degree to a maximum of 4 in the experiments in “Performance comparison of ML and analytical models” and “Performance analysis using extrapolation” sections.

Gradient boost models

GBR algorithm is a popular technique used for building prediction models. GBR uses the forward stage-wise fashion technique with the optimization of arbitrary differentiable loss functions. In this experiment, we used GBR from scikit-learn [48] with the default parameters. We obtained the best results with default random state 1 for all five workloads. However, we further investigated the model by increasing the random state, but the results were unsatisfactory and out of scope for inclusion in this study. In Table 7, the statistical results are shown based on $R^2$, standard deviation, and RRSE scores.

Table 7 GBR algorithm statistical parameters using set of different workloads

Full size table

Performance comparison of ML and analytical models

This section illustrates the proposed model accuracy in terms of $R^2$, standard deviation and the RRSE values. It shows the performance comparison results between well-known parallelisation models (Amdahl’s and Gustafson, ERNEST) and ML algorithms such as KRR and GBR where KRR algorithm parameters were optimised. The KRR optimised parameters assist the model to maximise the performance accuracy of the model. The k-fold cross-validation technique was applied with all the models to achieve highest prediction accuracy. We obtained all the table results and figures using cross-validation with k = 3. The individual workload’s best results obtained by the proposed models against the ML models are described in the following section.

Wordcount

The wordcount workload comparative statistical results between ML and analytical models with cross validation, are presented in Table 8. From this analysis we can see that except ERNEST, all analytical models’ accuracy is 0.995, which is better than the KRR algorithm of 0.974. The GBR algorithm shows better performance as compared to others, where the accuracy is 0.998. The best analytical results and GBR results are presented in Fig. 6. The RRSE results in Table 9 show very low accuracy of GBR algorithm presenting the best fit in the results.

SVM

The comparison between ML and analytical models with cross validation for SVM workloads runtime prediction results are shown in Table 8. In this workload, both ML algorithms show significantly better results than the analytical models where the GBR algorithm completely outperforms the KRR algorithm. It may be noted that the GBR accuracy and RRSE is 0.995 and 0.064 respectively with corresponding standard deviations of 0.001 and 0.010. The comparative best results are plotted in Fig. 7.

Pagerank

The Pagerank performance evaluated in terms of ML algorithms and analytical models with cross-validation approach in Tables 8 and 9. Our results revealed that the model (Eq. 4) either outperforms or equal to all analytical models and KRR algorithms, but the GBR algorithm achieves the best results among models. In Fig. 8, the best results are plotted from GBR and Eq. (4).

Kmeans

The comparative performance measurement results of ML algorithm and analytical model for kmeans workload are presented in Table 8. The GBR algorithm is the most effective model that shows the best accuracy and produces low RRSE among all the models. It can be noted that analytical models outperform KRR algorithms where the accuracy is 0.981, but analytical models are better with a higher margin of accuracy of 0.992. Among the analytical models, their performance is either equal or slightly different. For example, the Gustafson accuracy is equal among analytical models, but the RRSE is slightly better, as shown in Table 9. The best-obtained results among models are shown in Fig. 9.

Graph

The ML algorithms (KRR and GBR) show excellent results using Graph workload. From the statistical results shown in the Tables 8 and 9, the GBR algorithm records the best results among analytical models and outperforms KRR. Equation (4) indicates a significant performance improvement among the analytical models where other equations previously proposed by us clearly defeat ERNEST, Amdahl and Gustafson models. The best results from the ML model and an analytical model is shown in Fig. 10.

Table 8 R-squared values for a different set of workloads and models

Full size table

Table 9 Relative mean standard error (RRSE) values for a different set of workloads and models

Full size table

In summary, the above results demonstrate model performances for the selected workloads. The GBR algorithm achieves an excellent performance in comparison to all models. On the other side, Eqs. (4) and (5) show excellent results among analytical models and both equations are better than KRR for WordCount, SVM, PageRank and Kmeans workload. For the graph workload, both ML algorithms demonstrated the best results. The above analysis shows the effectiveness of the analytical models over ML approaches.

Performance analysis using extrapolation

Data obtained from the results in “Performance comparison of ML and analytical models” section is used for the performance analysis using extrapolation in this section. However, rather than carrying out cross validation on the entire dataset, we reserved part of the data set to test how well each model can deal with extrapolation. This is a very important aspect of the prediction models, as extrapolation would allow practitioners to make accurate predictions with a very small number of experiments when using a specific cluster and workload. Our hypothesis was that despite the better results for the models using ML presented in “Performance comparison of ML and analytical models” section, equations that can represent the cluster behaviours could be more accurate for extrapolation. In other words, ML is an effective approach for predictions that fall within the range of values captured in the existing data, but can yield poor results if not enough data is available to describe runtimes beyond a given range. In these scenarios, ML methods may not be able to predict the actual pattern of communication that drives the runtime.

Due to limited data points, we employed the linear extrapolation approach to estimate the data values that are close to the existing data. Generally, it has been proven that nonlinear model accuracy is higher than the linear models because it is more likely to overfit the training data set, which shows the poor performance of the models [58]. In contrast, the linear models fit the data more accurately; thus, better fitting can be achieved from the unseen data. We observe from the presented results in Table 10 that the performance of the linear workloads is better than the quadratic workloads. Two extrapolation scenarios were considered: extrapolation by size, and extrapolation by number of executors.

Table 10 R-squared values for extrapolation on size

Full size table