Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Journal of Big Data

Table 2 Profiling CUDA kernel time in percent spent on communication operations via AllReduce, computations with the cuDNN library, and data loading functions

No. GPUs	PyTorch-DDP DALI			PyTorch-DDP native
No. GPUs	AllReduce [\(\%\)] (Communication)	data[\(\%\)] (I/O)	cuDNN[\(\%\)] (Computation)	AllReduce [\(\%\)] (Communication)	data[\(\%\)] (I/O)	cuDNN[\(\%\)] (Computation)
(a) Training of ResNet50 on ImageNet
4	15.40	22.00	32.50	22.40	21.00	30.80
8	19.00	21.40	31.75	23.95	20.05	29.20
16	21.00	20.95	30.70	27.15	18.83	27.35
32	27.09	18.98	28.14	31.30	17.26	25.11
64	30.87	17.76	26.35	32.75	16.30	23.55
128	33.61	17.03	24.99	49.48	11.77	17.33
256	37.08	15.78	23.26	76.77	5.06	7.14
512	43.48	13.57	20.02	82.61	3.66	5.52
1024	46.18	11.56	17.31	–	–	–
(b) Training of ResNet101 on ImageNet
4	13.30	23.00	46.00	28.65	22.50	38.12
8	20.55	21.25	41.45	30.15	18.28	35.52
16	24.08	20.37	39.65	35.67	16.76	32.71
32	25.36	18.71	36.99	35.46	14.59	28.43
64	37.17	16.69	33.39	37.69	15.31	29.88
128	36.29	16.74	34.02	42.32	13.39	26.38
256	39.31	15.54	31.56	56.43	11.38	22.83
512	37.73	15.40	31.59	59.18	11.87	24.45
1204	49.18	11.87	24.45	–	–	–
(c) Training of ResNet152 on ImageNet
4	16.20	22.40	44.60	18.41	21.97	44.17
8	20.55	21.75	42.35	20.65	21.95	40.75
16	25.90	20.05	39.07	24.77	20.70	38.62
32	29.16	18.72	37.15	30.31	18.77	35.32
64	33.56	16.90	33.82	38.34	16.42	30.80
128	36.16	16.66	33.73	45.75	14.02	26.60
256	38.33	15.51	31.60	49.39	15.05	28.46
512	40.36	14.41	29.43	51.76	11.16	25.36
1024	43.21	13.08	26.99	–	–	–

The first 10 epochs of the training process are profiled with the NSys Profiler (first five epochs for four GPUs due to time limits of the profiler)