Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Ahmed, Nasim; Barczak, Andre L. C.; Rashid, Mohammad A.; Susnjak, Teo

doi:10.1186/s40537-022-00623-1

Journal of Big Data

Table 1 Various models on Spark performance prediction

From: Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Published work	Workloads and data sets	Models	Metrics (error and accuracy)
Cheng [12]	WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight	Adaboost, ensemble learners, multiple learners, and projective sampling	Average accuracy error: Adaboost (30 cases): 9.02%, ensemble learners: 18.63%, multiple learners: 21.98%, projective sampling: 14.09%
de Oliveria [31]	Data sets: astronomy and bioinformatics Data partitions: 3	Decision tree (DT)	Prediction accuracy: best 3 scenario out of 7: SC1: 90.4%, 88.8%, and 86.5%
Boden [32]	WordCount, Grep, and Sort	Logistic regression (LR), and Kmeans	High data dimensionality
Boden [33]	Data set: CriteoClick Logs and Netflix Prize Kmeans, logistic regression (LG), matrix factorization (MF), and gradient boost regression (GBR)	Kmeans, logistic regression (LR), matrix factorization (MF), and gradient boost regression (GBR)	MF: required more time than single LibMF, LR: Spark MLlib required more hardware resources, GBR: better than LR
Assefi [35]	Data sets: HEPMASS, SUSY, HIGGS, LIGHT, HETROACT I and II	Support vector machine (SVM), decision tree (DT), Kmeans, NaıveBayes (NB), Weka, and random forest (RF)	t-test: p < 0.01
Javaid [36]	KMeans, PageRank, sorting, WordCount, binomial logistic regression, linear regression, groupby decision tree classifier, single source shortest path, and breadth first search	Linear regression (LR), random forest (RF) gradient boost machine (GBM), and neural networks (NN)	Average accuracy error: LR, GBM, RF, and NN 10% (approx)
Singhal [37]	Wordcount, Terasort, Kmeans and SQL	Multi linear regression (MLR), MLR-quadratic (MLRQ), support vector machine (SVM), and analytical model	Prediction accuracy error: MLR, SVM, and, MLRQ: MAPE 22%, analytical models: 80%
Cheng [13]	WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight	AB-MOEA/D, random forest (RF), and two-stage tree (TSt)	Prediction error: AB-MOEA/D: 3.6%, RF: 8.97%, TSt:14.57%

Back to article page