Skip to main content

Table 1 Various models on Spark performance prediction

From: Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Published work Workloads and data sets Models Metrics (error and accuracy)
Cheng [12] WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight Adaboost, ensemble learners, multiple learners, and projective sampling Average accuracy error: Adaboost (30 cases): 9.02%, ensemble learners: 18.63%, multiple learners: 21.98%, projective sampling: 14.09%
de Oliveria [31] Data sets: astronomy and bioinformatics
Data partitions: 3
Decision tree (DT) Prediction accuracy: best 3 scenario out of 7: SC1: 90.4%, 88.8%, and 86.5%
Boden [32] WordCount, Grep, and Sort Logistic regression (LR), and Kmeans High data dimensionality
Boden [33] Data set: CriteoClick Logs and Netflix Prize Kmeans, logistic regression (LG), matrix factorization (MF), and gradient boost regression (GBR) Kmeans, logistic regression (LR), matrix factorization (MF), and gradient boost regression (GBR) MF: required more time than single LibMF, LR: Spark MLlib required more hardware resources, GBR: better than LR
Assefi [35] Data sets: HEPMASS, SUSY, HIGGS, LIGHT, HETROACT I and II Support vector machine (SVM), decision tree (DT), Kmeans, NaıveBayes (NB), Weka, and random forest (RF) t-test: p < 0.01
Javaid [36] KMeans, PageRank, sorting, WordCount, binomial logistic regression, linear regression, groupby decision tree classifier, single source shortest path, and breadth first search Linear regression (LR), random forest (RF) gradient boost machine (GBM), and neural networks (NN) Average accuracy error: LR, GBM, RF, and NN 10% (approx)
Singhal [37] Wordcount, Terasort, Kmeans and SQL Multi linear regression (MLR), MLR-quadratic (MLRQ), support vector machine (SVM), and analytical model Prediction accuracy error: MLR, SVM, and, MLRQ: MAPE 22%, analytical models: 80%
Cheng [13] WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight AB-MOEA/D, random forest (RF), and two-stage tree (TSt) Prediction error: AB-MOEA/D: 3.6%, RF: 8.97%, TSt:14.57%