A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Journal of Big Data

Table 3 Spark HiBenchmark workload considered in this study

Benchmark categories	Application	Input data size	Input samples
Micro Benchmark	WordCount	313 MB, 940 MB, 5.9 GB, 8.8 GB, and 19.2 GB	-
Machine learning	K-means (small job)	1.3 MB, 2.7 MB, 4 MB, 5.3 MB, and 13.3 MB	3000, 5000, 7000 (sample), 1 and 3 (million samples)
	K-means (large job)	19 GB, 56 GB, 94 GB, 130 GB, and 168 GB	10, 30, 50, 70, and 90 (million samples)
	SVM	34 MB, 60 MB, 1.2 GB, 1.8 GB, and 2 GB	2100, 2600, 3600, 4100, and 5100 (samples)
Web search	PageRank (small job)	3.8 MB, 5.7 MB, 8 MB, 10 MB, and 12.2 MB	1, 15, 20, 25, and 30 (thousand of samples)
Web search	PageRank (large job)	507 MB, 1.6 GB, 2.8 GB, 4 GB, and 5 GB	1, 3, 5, 7, and 9 (million of pages)
Graph	Nweight	37 MB, 70 MB, 129 MB, 155 MB, and 211 MB	1, 2, 4, 5, and 7 (million of edges)