Skip to content

Latest commit

 

History

History

train

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Train

Examples for training LLaMa2 on Cloud TPUs with HuggingFace, Ray, and PyTorch/XLA with SPMD.

This folder consists of:

  • configs: the model definition of the various LLaMa-2 configurations, including a 2B variant for quick development work,
  • llama_hf.py, an adaptation of run_clm.py with PyTorch/XLA:SPMD support and simple refactoring for visual demonstration purposes, and
  • main.py, a full-fledged production job that can run on Ray Jobs.

These examples train from scratch and serve as a reference point for you to get started with Ray on TPUs.

To get started, spin up your training cluster:

$ ray up -y cluster/train.yaml

and once it's up and running, you can use

$ ./scripts/submit_train.sh

to submit the training job. Then you can run

$ ray dashboard cluster/train.yaml

and go to http://localhost:8265 to view the Job logs.