Published work | Workloads and data sets | Models | Metrics (error and accuracy) |
---|---|---|---|
Cheng [12] | WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight | Adaboost, ensemble learners, multiple learners, and projective sampling | Average accuracy error: Adaboost (30 cases): 9.02%, ensemble learners: 18.63%, multiple learners: 21.98%, projective sampling: 14.09% |
de Oliveria [31] | Data sets: astronomy and bioinformatics Data partitions: 3 | Decision tree (DT) | Prediction accuracy: best 3 scenario out of 7: SC1: 90.4%, 88.8%, and 86.5% |
Boden [32] | WordCount, Grep, and Sort | Logistic regression (LR), and Kmeans | High data dimensionality |
Boden [33] | Data set: CriteoClick Logs and Netflix Prize Kmeans, logistic regression (LG), matrix factorization (MF), and gradient boost regression (GBR) | Kmeans, logistic regression (LR), matrix factorization (MF), and gradient boost regression (GBR) | MF: required more time than single LibMF, LR: Spark MLlib required more hardware resources, GBR: better than LR |
Assefi [35] | Data sets: HEPMASS, SUSY, HIGGS, LIGHT, HETROACT I and II | Support vector machine (SVM), decision tree (DT), Kmeans, NaıveBayes (NB), Weka, and random forest (RF) | t-test: p < 0.01 |
Javaid [36] | KMeans, PageRank, sorting, WordCount, binomial logistic regression, linear regression, groupby decision tree classifier, single source shortest path, and breadth first search | Linear regression (LR), random forest (RF) gradient boost machine (GBM), and neural networks (NN) | Average accuracy error: LR, GBM, RF, and NN 10% (approx) |
Singhal [37] | Wordcount, Terasort, Kmeans and SQL | Multi linear regression (MLR), MLR-quadratic (MLRQ), support vector machine (SVM), and analytical model | Prediction accuracy error: MLR, SVM, and, MLRQ: MAPE 22%, analytical models: 80% |
Cheng [13] | WordCount, Kmeans, TeraSort, PageRank, Bayes, and Nweight | AB-MOEA/D, random forest (RF), and two-stage tree (TSt) | Prediction error: AB-MOEA/D: 3.6%, RF: 8.97%, TSt:14.57% |