Skip to main content

Table 1 Various models on Spark performance prediction

From: Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Published work

Workloads and data sets

Models

Metrics (error and accuracy)

Cheng [12]

WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight

Adaboost, ensemble learners, multiple learners, and projective sampling

Average accuracy error: Adaboost (30 cases): 9.02%, ensemble learners: 18.63%, multiple learners: 21.98%, projective sampling: 14.09%

de Oliveria [31]

Data sets: astronomy and bioinformatics

Data partitions: 3

Decision tree (DT)

Prediction accuracy: best 3 scenario out of 7: SC1: 90.4%, 88.8%, and 86.5%

Boden [32]

WordCount, Grep, and Sort

Logistic regression (LR), and Kmeans

High data dimensionality

Boden [33]

Data set: CriteoClick Logs and Netflix Prize Kmeans, logistic regression (LG), matrix factorization (MF), and gradient boost regression (GBR)

Kmeans, logistic regression (LR), matrix factorization (MF), and gradient boost regression (GBR)

MF: required more time than single LibMF, LR: Spark MLlib required more hardware resources, GBR: better than LR

Assefi [35]

Data sets: HEPMASS, SUSY, HIGGS, LIGHT, HETROACT I and II

Support vector machine (SVM), decision tree (DT), Kmeans, NaıveBayes (NB), Weka, and random forest (RF)

t-test: p < 0.01

Javaid [36]

KMeans, PageRank, sorting, WordCount, binomial logistic regression, linear regression, groupby decision tree classifier, single source shortest path, and breadth first search

Linear regression (LR), random forest (RF) gradient boost machine (GBM), and neural networks (NN)

Average accuracy error: LR, GBM, RF, and NN 10% (approx)

Singhal [37]

Wordcount, Terasort, Kmeans and SQL

Multi linear regression (MLR), MLR-quadratic (MLRQ), support vector machine (SVM), and analytical model

Prediction accuracy error: MLR, SVM, and, MLRQ: MAPE 22%, analytical models: 80%

Cheng [13]

WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight

AB-MOEA/D, random forest (RF), and two-stage tree (TSt)

Prediction error: AB-MOEA/D: 3.6%, RF: 8.97%, TSt:14.57%