This repository provides the PyTorch codes for the paper "Fully Decoupled Neural Network Learning Using Delayed Gradients" https://ieeexplore.ieee.org/abstract/document/9399673.
The FDG splits a neural network into multiple modules that are trained independently and asynchronously in different GPUs. We also introduce a gradient shrinking process to reduce the stale gradient effect caused by the delayed gradients. The proposed FDG is able to train very deep networks (>1000 layers) and very large networks (>35 million parameters) with significant speed gains while outperforming the state-of-the-art methods and the standard BP.
For any question and comments, please contact us using [email protected]
- Python 3.6
- Pytorch 1.0
- CUDA 10.1
K is the number of split modules; β is a gradient shrinking factor (for β=1, there is no gradient shrinking); * means the results are rerun.
We use SGD optimizer with an initial learning rate of 0.1. The momentum and weight decay are set as 0.9 and
Architecture | # params | BP | DDG | DGL | FR | FDG |
---|---|---|---|---|---|---|
ResNet-20 | 0.27M | 8.75%/7.78%* | - | - | 7.92%(β=1)/7.23%(β=0.2) | |
ResNet-56 | 0.46M | 6.97%/6.19%* | 6.89%/6.63%* | 6.77%* | 6.07%* | 6.20%(β=1)/5.90%(β=0.5) |
ResNet-110 | 1.70M | 6.41%/5.79%* | 6.59%/6.26%* | 6.50%/6.26%* | 5.76%* | 5.79%(β=1)/5.73%(β=0.5) |
ResNet-18 | 11.2M | 6.48%/4.87%* | 5.00%* | 5.21%* | 4.80%* | 4.82%(β=1)/4.79%(β=0.8) |
WRN-28-10 | 19.4M | 7.93%/5.53%* | - | - | 5.50%(β=1)/5.49%(β=0.7) | |
WRN-28-10 | 36.5M | 4.00%/4.01%* | - | - | 4.13%(β=1)/3.85%(β=0.7) |
Architecture | # params | BP | DDG | DGL | FR | FDG |
---|---|---|---|---|---|---|
ResNet-56 | 0.46M | 30.21%/27.68%* | 29.83%/28.44%* | 29.51%* | 28.39%* | 27.87%(β=1)/27.49%(β=0.4) |
ResNet-110 | 1.70M | 28.10%/25.82%* | 28.61%/27.16%* | 26.80%* | 26.31%* | 25.73%(β=1)/25.43%(β=0.5) |
ResNet-18 | 11.2M | 22.35%* | 22.74%* | 22.24%* | 22.88%* | 22.78%(β=1)/22.18%(β=0.5) |
WRN-28-10 | 36.5M | 19.2%/19.6%* | - | - | - | 20.28%(β=1)/19.08%(β=0.6) |
Split | BP | DDG | DGL | FR | FDG |
---|---|---|---|---|---|
K=2 | 6.19% | 6.60%* | 6.77%* | 6.07%* | 6.20%(β=1)/5.90%(β=0.5) |
K=3 | 6.19% | 6.50%* | 8.88%* | 6.33%* | 6.20%(β=1)/6.08%(β=0.5) |
K=4 | 6.19% | 6.61%* | 9.65%* | 6.48%* | 6.83%(β=1)/6.14%(β=0.5) |
We provide the time performances of several techniques including the data parallelization (DP). Under similar GPU utilization, our FDG is the fastest among the methods.
@article{zhuang2021fully,
title={Fully decoupled neural network learning using delayed gradients},
author={Zhuang, Huiping and Wang, Yi and Liu, Qinglai and Lin, Zhiping},
journal={IEEE Transactions on Neural Networks and Learning Systems},
year={2021},
publisher={IEEE}
}