Estimating runtime of a job in Hadoop MapReduce

Journal of Big Data

Table 2 Node configurations [28]

Parameters	Description	Notation	Default value
dfs.blocksize	Size of the blocks or splits	BS	128 MB
mapreduce.task.io.sort.mb	The total amount of buffer memory to use while sorting files, in megabytes.	Sort.mb	100
mapreduce.task.io.sort.factor	The number of streams to merge at once while sorting files. This determines the number of open file handles.	Sort.factor	10
mapreduce.map.sort.spill.percent	A thread will begin to spill the contents to disk in the background.	Spill.percent	0.80
io.sort.record.percent	This parameter determines the percentage of sort.mb used to store map output’s metadata.	Record.percent	0.05
mapreduce.reduce.shuffle.parallelcopies	The default number of parallel transfers run by reduce during the copy (shuffle) phase.	parallelcopies	5
mapreduce.reduce.shuffle.merge.percent	The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.	shuffle.merge.percent	0.66
mapreduce.reduce.shuffle.input.buffer.percent	The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.	Shuffle. buffer.percent	0.70
Max Heap size of reduce task	Max heap size that can be used by reduce task	Heap_r	1024 MB