Ever wondered how to train a large neural network across a giant cluster? Look no further!
This is a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilizing all resources available. It is organized into sequential chapters, each with a README.md
and a train_llm.py
script in them. The readme will discuss both the high level concepts of distributed training, and the code changes introduced in that chapter.
The guide is written entirely in very minimal standard pytorch, using transformers
and datasets
for models and data, respectively. No other library is used for distributed code - the distributed stuff is entirely in pytorch.
- Chapter 1 - A standard Causal LLM training script that runs on a single GPU.
- Chapter 2 - Upgrades the training script to support multiple GPUs and to use DDP.
- Chapter 3 - Covers how to launch training jobs across clusters with multiple nodes.
- Chapter 4 - Upgrades the training script to use FSDP instead of DDP for more optimal memory usage.
- Chapter 5 - Upgrades the training script to train Llama-405b.
- Alternative Frameworks - Covers different frameworks that all work with pytorch under the hood.
- Diagnosing Errors - Best practices and how tos for quickly diagnosing errors in your cluster.
- Related Topics - Topics that you should be aware of when distributed training.
Questions this guide answers:
- How do I update a single gpu training/fine tuning script to run on multiple GPUs or multiple nodes?
- How do I diagnose hanging/errors that happen during training?
- My model/optimizer is too big for a single gpu - how do I train/fine tune it on my cluster?
- How do I schedule/launch training on a cluster?
- How do I scale my hyperparameters when increasing the number of workers?
Best practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.
Each of the training scripts is aimed at training a causal language model (i.e. gpt/llama).
git clone https://github.com/LambdaLabsML/distributed-training-guide.git
cd distributed-training-guide
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt
This tutorial uses wandb
as an experiment tracker.
wandb login
🦄 Other exciting ML projects at Lambda: ML Times, Text2Video, GPU Benchmark.