Fully Decoupled Neural Network Learning Using Delayed Gradients (FDG)

Introduction

This repository provides the PyTorch codes for the paper "Fully Decoupled Neural Network Learning Using Delayed Gradients" https://ieeexplore.ieee.org/abstract/document/9399673.

The FDG splits a neural network into multiple modules that are trained independently and asynchronously in different GPUs. We also introduce a gradient shrinking process to reduce the stale gradient effect caused by the delayed gradients. The proposed FDG is able to train very deep networks (>1000 layers) and very large networks (>35 million parameters) with significant speed gains while outperforming the state-of-the-art methods and the standard BP.

For any question and comments, please contact us using [email protected]

Environment

Python 3.6
Pytorch 1.0
CUDA 10.1

Algorithm illustration

Some notation

K is the number of split modules; β is a gradient shrinking factor (for β=1, there is no gradient shrinking); ^* means the results are rerun.

Results

Setting

We use SGD optimizer with an initial learning rate of 0.1. The momentum and weight decay are set as 0.9 and $5\times 10^{-4}$ respectively. All the models are trained using a batch size of 128 for 300 epochs. The learning rate is divided by 10 at 150, 225 and 275 epochs. The test errors of the FDG are reported by the median of 3 runs.

K=2, CIFAR-10

Architecture	# params	BP	DDG	DGL	FR	FDG
ResNet-20	0.27M	8.75%/7.78%^*	-	-		7.92%(β=1)/7.23%(β=0.2)
ResNet-56	0.46M	6.97%/6.19%^*	6.89%/6.63%^*	6.77%^*	6.07%^*	6.20%(β=1)/5.90%(β=0.5)
ResNet-110	1.70M	6.41%/5.79%^*	6.59%/6.26%^*	6.50%/6.26%^*	5.76%^*	5.79%(β=1)/5.73%(β=0.5)
ResNet-18	11.2M	6.48%/4.87%^*	5.00%^*	5.21%^*	4.80%^*	4.82%(β=1)/4.79%(β=0.8)
WRN-28-10	19.4M	7.93%/5.53%^*	-	-		5.50%(β=1)/5.49%(β=0.7)
WRN-28-10	36.5M	4.00%/4.01%^*	-	-		4.13%(β=1)/3.85%(β=0.7)

K=2, CIFAR-100

Architecture	# params	BP	DDG	DGL	FR	FDG
ResNet-56	0.46M	30.21%/27.68%^*	29.83%/28.44%^*	29.51%^*	28.39%^*	27.87%(β=1)/27.49%(β=0.4)
ResNet-110	1.70M	28.10%/25.82%^*	28.61%/27.16%^*	26.80%^*	26.31%^*	25.73%(β=1)/25.43%(β=0.5)
ResNet-18	11.2M	22.35%^*	22.74%^*	22.24%^*	22.88%^*	22.78%(β=1)/22.18%(β=0.5)
WRN-28-10	36.5M	19.2%/19.6%^*	-	-	-	20.28%(β=1)/19.08%(β=0.6)

For multiple GPUs (CIFAR-10 with ResNet-56)

Split	BP	DDG	DGL	FR	FDG
K=2	6.19%	6.60%^*	6.77%^*	6.07%^*	6.20%(β=1)/5.90%(β=0.5)
K=3	6.19%	6.50%^*	8.88%^*	6.33%^*	6.20%(β=1)/6.08%(β=0.5)
K=4	6.19%	6.61%^*	9.65%^*	6.48%^*	6.83%(β=1)/6.14%(β=0.5)

Time Performance

We provide the time performances of several techniques including the data parallelization (DP). Under similar GPU utilization, our FDG is the fastest among the methods.

Citation

@article{zhuang2021fully,
  title={Fully decoupled neural network learning using delayed gradients},
  author={Zhuang, Huiping and Wang, Yi and Liu, Qinglai and Lin, Zhiping},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  year={2021},
  publisher={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
DG_datasets.py		DG_datasets.py
DG_models.py		DG_models.py
DG_parser.py		DG_parser.py
FDG.py		FDG.py
FDG_algorithm.png		FDG_algorithm.png
FDG_flow.pdf		FDG_flow.pdf
FDG_flow.png		FDG_flow.png
README.md		README.md
ResNet_ImageNet.py		ResNet_ImageNet.py
ResNet_cifar.py		ResNet_cifar.py
WRN.py		WRN.py
table_time.png		table_time.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fully Decoupled Neural Network Learning Using Delayed Gradients (FDG)

Introduction

Environment

Algorithm illustration

Some notation

Results

Setting

K=2, CIFAR-10

K=2, CIFAR-100

For multiple GPUs (CIFAR-10 with ResNet-56)

Time Performance

Citation

About

Releases

Packages

Languages

ZHUANGHP/FDG

Folders and files

Latest commit

History

Repository files navigation

Fully Decoupled Neural Network Learning Using Delayed Gradients (FDG)

Introduction

Environment

Algorithm illustration

Some notation

Results

Setting

K=2, CIFAR-10

K=2, CIFAR-100

For multiple GPUs (CIFAR-10 with ResNet-56)

Time Performance

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages