Theano-MPI is a python framework for distributed training of deep learning models built in Theano. It implements data-parallelism in serveral ways, e.g., Bulk Synchronous Parallel, Elastic Averaging SGD and Gossip SGD. This project is an extension to theano_alexnet, aiming to scale up the training framework to more than 8 GPUs and across nodes. Please take a look at this technical report for an overview of implementation details. To cite our work, please use the following bibtex entry.
@article{ma2016theano,
title = {Theano-MPI: a Theano-based Distributed Training Framework},
author = {Ma, He and Mao, Fei and Taylor, Graham~W.},
journal = {arXiv preprint arXiv:1605.08325},
year = {2016}
}
Theano-MPI is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as theano shared variables. Theano-MPI also comes with a light-weight layer library for you to build customized models. See wiki for a quick guide on building customized neural networks based on them. Check out the examples of building Lasagne VGGNet, Wasserstein GAN, LS-GAN and Keras Wide-ResNet.
Theano-MPI depends on the following libraries and packages. We provide some guidance to the installing them in wiki.
- OpenMPI 1.8 or an MPI-2 standard equivalent that supports CUDA.
- mpi4py built on OpenMPI.
- numpy
- Theano 0.9
- zeromq
- hickle
- CUDA 7.5
- cuDNN a version compatible with your CUDA Installation.
- pygpu
- NCCL
Once all dependeices are ready, one can clone Theano-MPI and install it by the following.
$ python setup.py install [--user]
To accelerate the training of Theano models in a distributed way, Theano-MPI tries to identify two components:
- the iterative update function of the Theano model
- the parameter sharing rule between instances of the Theano model
It is recommended to organize your model and data definition in the following way.
launch_session.py
orlaunch_session.cfg
models/*.py
__init__.py
modelfile.py
: defines your customized ModelClassdata/*.py
dataname.py
: defines your customized DataClass
Your ModelClass in modelfile.py
should at least have the following attributes and methods:
self.params
: a list of Theano shared variables, i.e. trainable model parametersself.data
: an instance of your customized DataClass defined indataname.py
self.compile_iter_fns
: a method, your way of compiling train_iter_fn and val_iter_fnself.train_iter
: a method, your way of using your train_iter_fnself.val_iter
: a method, your way of using your val_iter_fnself.adjust_hyperp
: a method, your way of adjusting hyperparameters, e.g., learning rate.self.cleanup
: a method, necessary model and data clean-up steps.
Your DataClass in dataname.py
should at least have the following attributes:
self.n_batch_train
: an integer, the amount of training batches needed to go through in an epochself.n_batch_val
: an integer, the amount of validation batches needed to go through during validation
After your model definition is complete, you can choose the desired way of sharing parameters among model instances:
- BSP (Bulk Syncrhonous Parallel)
- EASGD (Elastic Averaging SGD)
- GOSGD (Gossip SGD)
Below is an example launch config file for training a customized ModelClass on two GPUs.
# launch_session.cfg
RULE=BSP
MODELFILE=models.modelfile
MODELCLASS=ModelClass
DEVICES=cuda0,cuda1
Then you can launch the training session by calling the following command:
$ tmlauncher -cfg=launch_session.cfg
Alternatively, you can launch sessions within python as shown below:
# launch_session.py
from theanompi import BSP
rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1'] ,
modelfile = 'models.modelfile',
modelclass = 'ModelClass')
rule.wait()
More examples can be found here.
Training ( communication) time per 5120 images in seconds: [allow_gc = True, using nccl32 on copper]
Model | 1GPU | 2GPU | 4GPU | 8GPU |
---|---|---|---|---|
AlexNet-128b | 20.50 | 10.35 0.78 | 5.13 0.54 | 2.63 0.61 |
GoogLeNet-32b | 63.89 | 31.40 1.00 | 15.51 0.71 | 7.69 0.80 |
VGG16-16b | 358.29 | 176.08 13.90 | 90.44 9.28 | 55.12 12.59 |
VGG16-32b | 343.37 | 169.12 7.14 | 86.97 4.80 | 43.29 5.41 |
ResNet50-64b | 163.15 | 80.09 0.81 | 40.25 0.56 | 20.12 0.57 |
More details on the benchmark can be found in this notebook.
-
Test your single GPU model with
theanompi/models/test_model.py
before trying data-paralle rule. -
You may want to use those helper functions in
/theanompi/lib/opt.py
to construct optimizers in order to avoid common pitfalls mentioned in (#22) and get better convergence. -
Binding cores according to your NUMA topology may give better performance. Try the
-bind
option with the launcher (needs hwloc depedency). -
Using the launcher script is prefered to start training. Using python to start training currently cause core binding problem especially on a NUMA system.
-
Shuffling training examples before asynchronous training makes the loss surface a lot smoother during model converging.
-
Some known bugs and possible enhancement are listed in Issues. We welcome all kinds of participation (bug reporting, discussion, pull request, etc) in improving the framework.
© Contributors, 2016-2017. Licensed under an ECL-2.0 license.