Skip to main content

Table 2 Profiling CUDA kernel time in percent spent on communication operations via AllReduce, computations with the cuDNN library, and data loading functions

From: Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

No. GPUs

PyTorch-DDP DALI

PyTorch-DDP native

AllReduce [\(\%\)]

(Communication)

data[\(\%\)]

(I/O)

cuDNN[\(\%\)]

(Computation)

AllReduce [\(\%\)]

(Communication)

data[\(\%\)]

(I/O)

cuDNN[\(\%\)]

(Computation)

(a) Training of ResNet50 on ImageNet

4

15.40

22.00

32.50

22.40

21.00

30.80

8

19.00

21.40

31.75

23.95

20.05

29.20

16

21.00

20.95

30.70

27.15

18.83

27.35

32

27.09

18.98

28.14

31.30

17.26

25.11

64

30.87

17.76

26.35

32.75

16.30

23.55

128

33.61

17.03

24.99

49.48

11.77

17.33

256

37.08

15.78

23.26

76.77

5.06

7.14

512

43.48

13.57

20.02

82.61

3.66

5.52

1024

46.18

11.56

17.31

(b) Training of ResNet101 on ImageNet

4

13.30

23.00

46.00

28.65

22.50

38.12

8

20.55

21.25

41.45

30.15

18.28

35.52

16

24.08

20.37

39.65

35.67

16.76

32.71

32

25.36

18.71

36.99

35.46

14.59

28.43

64

37.17

16.69

33.39

37.69

15.31

29.88

128

36.29

16.74

34.02

42.32

13.39

26.38

256

39.31

15.54

31.56

56.43

11.38

22.83

512

37.73

15.40

31.59

59.18

11.87

24.45

1204

49.18

11.87

24.45

(c) Training of ResNet152 on ImageNet

4

16.20

22.40

44.60

18.41

21.97

44.17

8

20.55

21.75

42.35

20.65

21.95

40.75

16

25.90

20.05

39.07

24.77

20.70

38.62

32

29.16

18.72

37.15

30.31

18.77

35.32

64

33.56

16.90

33.82

38.34

16.42

30.80

128

36.16

16.66

33.73

45.75

14.02

26.60

256

38.33

15.51

31.60

49.39

15.05

28.46

512

40.36

14.41

29.43

51.76

11.16

25.36

1024

43.21

13.08

26.99

  1. The first 10 epochs of the training process are profiled with the NSys Profiler (first five epochs for four GPUs due to time limits of the profiler)