train

Train

Examples for training LLaMa2 on Cloud TPUs with HuggingFace, Ray, and PyTorch/XLA with SPMD.

This folder consists of:

configs: the model definition of the various LLaMa-2 configurations, including a 2B variant for quick development work,
llama_hf.py, an adaptation of run_clm.py with PyTorch/XLA:SPMD support and simple refactoring for visual demonstration purposes, and
main.py, a full-fledged production job that can run on Ray Jobs.

These examples train from scratch and serve as a reference point for you to get started with Ray on TPUs.

To get started, spin up your training cluster:

$ ray up -y cluster/train.yaml

and once it's up and running, you can use

$ ./scripts/submit_train.sh

to submit the training job. Then you can run

$ ray dashboard cluster/train.yaml

and go to http://localhost:8265 to view the Job logs.