Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) to utilize large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations. This synchronization is the central algorithmic bottleneck. To combat this, we introduce the Distributed Asynchronous and Selective Optimization (DASO) method which leverages multi-GPU compute node architectures to accelerate network training. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to other existing data parallel training methods.

Several works have investigated the use of asynchronous SGD (ASGD) [6,7,8], which updates the parameters whenever a network finishes a backward pass. Each network retrieves the current model parameters from a parameter server before performing a forward-backward pass. After finishing the backward step, the network sends its updated parameters back to the server, which determines the new global parameters using the updates from all processes. However, if a network is still computing the forward-backward pass when the parameter server is updated, the network's current parameters are outdated. The subsequently found gradients are referred to as stale. Stale gradients can be leveraged to approximate accurate network parameters, and ASGD has been shown to yield consistent convergence [9]. Recent attempts at accelerating ASGD have been made using individual network optimizers for a warm-up phase and delayed updates to the parameter server [10].
PyTorch [11] and TensorFlow [12] are currently the largest machine learning frameworks. Both offer options for traditional data parallel training. For large systems, a global communication protocol, such as MPI [13], is often required to leverage specialized inter-node connections. Recently, there have been many advancements in the optimization of the global parameter synchronization operation by using MPI with multiple network topologies [14,15]. These approaches have shown promising results, but remain centered around the idea of a global synchronization for each forward-backward pass.
Currently, the most popular MPI-enabled DPNN framework is Horovod [16]. To reduce the size of data sent via the communication network, Horovod uses tensor fusion, or grouping parameters together to be communicated in a larger chunk of data, and data compression. The data compression in Horovod is frequently done by casting the network parameters into 16-bit floating-point format.

Distributed Asynchronous and Selective Optimization (DASO)
The common approach to training DPNNs is to perform a forward-backward pass on each network instance with one portion of the distributed batch, then synchronize the network parameters via a global averaging operation. The averaging of gradients is only an approximation of the true gradients that would be calculated for the entire batch when processes on a single GPU. This approximation is made under the assumption that each portion of the distributed batch is independent and identically distributed (iid) [17].
Under the iid assumption, another approximation can be made: the average parameters of a subset of networks are not significantly different than the average parameters of the complete set of networks. Recalling that modern HPC clusters have different inter-and intra-node communication capabilities (with different bandwidths and latencies), we can utilize this approximation to reduce the communication needed for parallel training, thereby alleviating the intrinsic bottleneck of blocking synchronizations.
We therefore propose the Distributed Asynchronous and Selective Optimization (DASO) method. Instead of a uniform communications network across multiple multi-GPU nodes, DASO uses a hierarchical network model with node-local networks and a global network.
The global network spans all GPUs on all nodes, while the node-local networks are composed of the GPUs on each individual node. The global network is divided into multiple groups, with each group containing a single GPU from every node. Global communication takes place exclusively within a group, i.e. only group members exchange data, while members of other groups do not participate. Communication between the node-local GPUs is then handled by the local network, which benefits from high-speed GPU-to-GPU interconnects and optimized communication packages (e.g. NCCL [18]). Under the assumption that the cluster node configurations are homogeneous, DASO creates groups between GPUs with the same local identifier as is shown in Figure 1. With this approach, inter-node communication can be reduced by a factor equal to the minimum number of GPUs per node. DASO utilizes a multi-step synchronization. Local synchronization ( Figure 2) occurs after each batch and uses the node-local network to do gradient-averaging between the local GPUs. Global synchronization ( Figure 3) occurs after one or more local synchronizations, in which the network parameters of all members of a single global group are shared and averaged. Following every global synchronization, a local update step broadcasts averaged parameters from the local group member to all other node-local GPUs ( Figure 4). The role of global synchronization rotates between groups to overlap communication and computation. Global synchronization can be performed in a blocking or non-blocking manner. In the blocking case, all synchronization steps are performed after each batch. To reduce the amount of data transferred, parameters are cast to a 16-bit datatype representation during buffer packaging. This operation does not effect convergence, as shown by [19]. Once received, the parameters are cast back to their original datatype. In the non-blocking case, the next forward-backward pass is started after the parameters are sent but before they are received. Datatype casting is not beneficial in this scenario, as it delays the start of parameter communications. Each neural network will conduct B forward-backward passes complete with local synchronization before the group members receive the sent parameters. Hence, the updates from the global communication step are outdated upon their arrival. To compensate for this, a weighted average of the stale global parameters and the current local parameters is calculated as follows:

Node
where x l t+S is the model state on GPU l after S batches to wait after starting batch t for the global synchronization data, x i t+1 is the globally exchanged model states, and P is the number of GPUs in the global network. The weighting of the local parameters was found experimentally. A detailed explanation of Equation (1) and its validity is provided in the supplementary material.
Training of a network with the DASO method can be divided into three key phases: warm-up, cycling, and cool-down. The warm-up and cool-down phases utilize blocking global synchronizations, while the cycling phase uses non-blocking global synchronizations. Given a fixed number of total epochs, warm-up and cool-down phases occur for a set number of epochs at the beginning and end of training respectively. The warm-up phase is used to quickly move away from the randomly initialized parameters and prepare for the cycling phase. The cool-down phase is intended to reduce the slight errors which can arise due to the slight deviation from the iid assumption for individual batches. In the cycling phase, the number of forward-backward passes between global synchronizations (B) and the number of batches to wait for global synchronization data (W ) are varied. B is specified manually upon initialization. For W , an initial value of B/4 was found empirically to perform best. Each time the training loss plateaus, B and W are reduced by a factor of two, down to a minimum of one. When B, W = 1 and the loss has plateaued, both are reset to their initial values and the process is repeated until the cool-down phase. The synchronization steps in the cycling phase are schematically shown in Figure 5.

Current Implementation
A DASO proof-of-concept is currently implemented in the HeAT framework [20] for usage with PyTorch networks.
HeAT is an open-source Python framework for distributed and GPU-accelerated data analytics which offers both low level array computations as well as assorted higher-level machine learning algorithms. The local networks utilize PyTorch's DistributedDataParallel class and distributed package [21]. The global communication network utilizes HeAT's MPI backend, which handles the automatic communication of PyTorch Tensors. The global groups are implemented as MPI groups.
To use this implementation of DASO to train an existing PyTorch network, only four additional functions need to be called and the data loaders need to be modified to distribute the data between all GPUs 1 . The function calls are illustrated in Listing 1. First, the node-local PyTorch processes are created, which will be utilized during the local synchronization step. Next, the optimizer instance, i.e. DASO, is created with a PyTorch node-local optimizer (e.g. SGD) and the number of epochs for training is specified. The DASO instance will find the aforementioned PyTorch processes automatically.

Performance Evaluation
We evaluate the DASO method on two common examples of data-intensive neural network challenges: a) image classification and b) semantic segmentation. For image classification, we trained ResNet-50 [1] on the ImageNet-2012 [22] dataset. This can be considered a standard benchmark for machine learning, since pre-trained ResNet-50 networks are the backbone of many computer vision pipelines [23]. For semantic segmentation, we trained a state-ofthe-art hierarchical multi-scale attention network [5] on the CityScapes [24] dataset.
All experiments were conducted on the JUWELS Booster at the Jülich Supercomputing Center [25]. This centers' HPC cluster has 936 GPU nodes each with two AMD EPYC Rome CPUs and four NVIDIA A100 GPUs, connected via an NVIDIA Mellanox HDR InfiniBand interconnect fabric. The following software versions were used: CUDA 11.0, ParaStationMPI 5.4.7-1-mt, Python 3.8.5, PyTorch 1.7.1+cu110, Horovod 0.21.1, and NCCL 2.8.3-1. The JUWELS Booster provides a CUDA-aware MPI implementation, meaning that GPUs can communicate directly with other GPUs.
We compared DASO to Horovod, as this is currently the most popular choice for MPI-based parallel training of neural networks on computer clusters. We elected not to compare with PyTorch's distributed package as it utilizes a similar approach to Horovod, namely compression and bucketing. Comparisons are done with respect to training time and accuracy.
Relevant network hyperparameters remain consistent for DASO and Horovod for each experiment. All tested networks use a learning rate scheduler. When the training loss plateaus, i.e. the training loss is not decreasing by more than a set percentage threshold, the scheduler decreases the learning rate by a set factor. Settings of the scheduler, as well as for the local optimizer settings, were set to be identical for both DASO and Horovod for each use-case. With respect to message packaging, Horovod was configured to use floating point 16 compression while DASO compresses to brain floating point 16

Image Classification -ImageNet
This experiment was conducted using the ResNet-50 architecture on the ImageNet dataset [22]. For this experiment, the ImageNet-2012 is a large dataset containing 1.2 million labeled images. We evaluate classification quality using top-1 accuracy, i.e. the accuracy with which the model predicts the image labels correctly with a single attempt. For training ResNet-50 on the ImageNet dataset, we consider a 75% top-1 accuracy to be a successful training.
File loading from disk and preprocessing were done using DALI [26]. Training was conducted using cross entropy loss and SGD with a momentum of 0.9 and weight decay of 0.0001 for 90 epochs with a learning rate warm-up phase of five epochs. These values were adapted from PyTorch's example training script for ResNet-50 on ImageNet. The maximum learning rate is scaled with the number of global processes. The learning rate decays by a factor of 0.5 when the training cross entropy loss is stable for 5 epochs.
Training was conducted on 4, 8, 16, 32, and 64 nodes, which equals 16, 32, 64, 128, and 256 GPUs, respectively. This corresponds to traditional strong scaling experiments for parallel algorithms, where an increase in nodes should ideally result in a proportional reduction in time.
Results of the experiment are shown in Figure 6. Both DASO and Horovod show desirable strong scaling behavior, i.e. a factor of two in GPU number results in the training time being halved. Due to DASO's optimized hierarchical communication scheme and the reduced number of synchronizations, DASO requires up to 25% less time for training compared to Horovod.
It can further be observed that up to 128 GPUs, DASO and Horovod yield similar levels of accuracy, see Figure 7. However, with more than 128 GPUs, both approaches did not exceed 75% top-1 accuracy. This is due to the fact that accuracy starts to a decrease at larger batch sizes in a traditional network unless special allowances are made [27]. Since we keep the portion of the distributed batch that is processed on each individual GPU the same, larger GPU counts ultimately result in a larger distributed batch. Hence, accuracy ultimately decreases. For DASO, the effect is more dramatic because completing batches without a global synchronization has a similar effect to increasing the size of the batch.

Semantic Segmentation -CityScapes
To further evaluate the performance of the DASO method, we conducted experiments on a cutting edge, state-of-the-art network. To this end, a hierarchical multi-scale attention network [5] was trained for semantic segmentation on the CityScapes [24] dataset. This dataset comprises a collection of images of streets in 50 cities across the world, with 5,000 finely annotated images and 20,000 coarsely annotated images. The network has an HRNet-OCR backbone, a dedicated fully convolutional head, an attention head, and an auxiliary semantic head [5].  The quality of semantic segmentation networks is often evaluated based on the intersection over union (IOU) [28] score. IOU is defined as the intersection of the correctly predicted annotations with the ground truth annotations, divided by their union. In this work, IOU ranges from 0.0 to 1.0.
The network was trained using the following parameters: 175 epochs; the region mutual information loss [29] function; a local SGD optimizer with a weight decay of 0.0001, a momentum of 0.9, and an initial learning rate of 0.0125; a learning rate scheduler which decays the learning rate by a factor of 0.75 when the loss is judged to be stable for 5 epochs. The number of epochs, loss function, and optimizer settings were determined by the original source [5]. The learning rate scheduler deploys a warm up phase of 5 epochs, in which the learning rate is slowly increased from 0.0 to 0.4, after which it decays as scheduled. For the DASO experiments, the synchronized batch normalization operation is conducted within the node-local process group.
In its original publication, the network was trained using supplementary data, whereas the herein presented experiments are performed using only the CityScapes dataset. To determine a baseline accuracy, the original network was trained with four GPUs on a single node using PyTorch's DistributedDataParallel package. This baseline measurement employed a polynomial decay learning rate scheduler, PyTorch's automatic mixed precision training and synchronized batch normalization layers. For more detail, see [5]. The baseline IOU of the original network was found to be 0.8258.
During the experiments, we found that for Horovod neither the automatic mixed precision nor the synchronized batch normalization functioned as intended when using the system scheduler software (SLURM [30]). Horovod requires usage of its custom scheduler horovodrun to enable full feature functionality. However, this software is not natively available on many computer clusters, including the JUWELS booster supercomputer. Hence, automatic mixed precision was removed and the synchronized batch normalization layers were replaced with local batch normalization layers.
Training times for various node counts are shown in Figure 8. For up to 128 GPUs, DASO completed the training process in approximately 35% less time than Horovod, demonstrating the advantage of our approach to fully leverage the systems communication architecture together with asynchronous parameter updates. At higher GPU counts the time savings drop to 30%, because there are fewer batches per epoch and hence skipping global synchronization operations provides less benefits.
Quality measurements (IOU) are shown in Figure 9. Although there is a very clear difference between Horovod and DASO, neither matches the accuracy of the baseline network. This is due to the naive learning rate scheduler used for training. With a tuned learning rate optimizer the 16, 32, and 64 node configuration should more accurately recreate the results of the baseline network. At 256 GPUs, training with Horovod did not yield any meaningful results. We hypothesize that this is caused by the lack of a functioning synchronized batch normalization operation in combination with a very large mini-batch.

Conclusion
In this work, we have introduced the distributed asynchronous and selective optimization (DASO) method. DASO utilizes a hierarchical communication scheme to fully leverage the communications infrastructure inherent to node-based computer clusters, which often see multiple GPUs per node. By favoring node-local parameter updates, DASO is able to reduce the amount of global communication required for full data parallel network synchronization. Thereby, our approach alleviates the bottleneck of blocking synchronization used in traditional data parallel approaches. We show that, if independent and identically distributed (iid) batches can be reasonably assumed, the global synchronization ubiquitous to the training of DPNNs is not required after each forward-backward pass.
We evaluated DASO on two common DPNN use-cases: image classification on the ImageNet dataset with ResNet-50, and semantic segmentation on the CityScapes dataset with a cutting edge multi-head attention network architecture. Our experiments show that DASO can reduce training time by up to 34% while maintaining similar prediction accuracy when compared to Horovod, the current standard for MPI-based data parallel network training.
At large node counts, DASO and Horovod both suffer a decrease in network accuracy. This is a well-known problem which relates to an increase in the distributed batch size. The effect is more pronounced with DASO due to the reduced number of global synchronization steps. This allows for the identification of where network modifications must be employed to handle very large node counts. We also note that DASO and Horovod will both yield sub-optimal results on datasets for which the iid assumption no longer holds. For those cases, however, data parallel training will be ineffective regardless of the communications scheme. Overall, DASO achieves close-to-optimal accuracies significantly faster than Horovod. Therefore, DASO is optimal for rapid initial training of large networks/datasets, where the training can be further fine-tuned using more traditional methods.
Ultimately, DASO improves the scalability of data parallel neural networks and demonstrates that using more GPUs does not have to be the only solution to speeding up training. With DASO, it is possible to efficiently train large models or process more training data. The beauty of our approach lies in the fact that it is a generic, non-tailored, and easy to implement approach that translates well to any large scale, node-based computer cluster or high-performance computing system. DASO opens the door to redefining data parallel neural network training towards asynchronous, multifaceted optimization approaches.

A.1 Proof of Convergence
Proof. The following proof of DASO's global synchronization method is based heavily on the convergence analysis shown by [31] and will show that the gradients determined with DASO are bounded.
Let X ⊂ R n be a known set, and f : X → R a differentiable, convex, L-smooth, and unknown function. Then, the estimator of the stochastic gradient of f (x) is a functiong(x) for inputs x determined by the realization of a random variable ζ, such that E[g(x; ζ)] = ∇f (x : ζ). In the following, ζ is omitted due to space constraints. The stochastic gradient descent (SGD) algorithm updates a model's state at batch t + 1, x t+1 , with the following rule x t+1 = x t − ηg(x t ), where η is the parametric learning rate.
A commonly used variant of SGD in practice is minibatching for computational efficiency reasons. In minibatch SGD, the true stochastic gradient is approximated by averaging across m input items x i , i.e.G(x t ) = 1 m m i=1g (x t,i ). The model state x t+1 for minibatch SGD is whereG (x t ) is an estimator of ∇f (x t ).
Let us now consider, that S subsequent update steps are performed. It is possible to write the model state as: One of the primary assumptions in SGD is the Lipschitz-continuous objective gradients. This has the effect that: where the Lipschitz constant, L, is greater than zero. Equation (4) implies that the expected decrease in the objective function, f (x), is bounded above by a set quantity, regardless of how the stochastic gradients arrived at x t [31].
In DASO, the local synchronization step is bound via the same assumptions as minibatch SGD outlined in [31], so long as the iid assumption is upheld. However, the non-standard global synchronization step used in DASO must be shown to be bound under the same principles. DASO's global synchronization is: where the l and p subscripts represent the node-local and global model states, S is the number of local update steps before global synchronization, and P is the number of processes.
Similar to Equation (2), this can also be represented via the locally and globally calculated gradients,G l (x l:t ) and G p (x p:t ) respectively. The global synchronization function in the gradient representation is as follows: where α = η/(2S+P ). Using this, Equation (2), and the fact that the updates between t and S are local synchronizations which take the form of Equation (3), we find that globally calculated gradients are as follows.
As all gradient elements in Equation (7) are bound under Equation (4),G DASO (x t+S−1 ) is similarly bounded.