Add docker compose and change containerized setup instructions to use…

… it (#1113) * Add pythia 14M config * Create 31M.yml * Add docker compose, update readme docker instructions to utilize it * Add logging limits to docker-compose files * Change data mount from /gpt-neox/data to /data/ This prevents possible errors if the user already has a /data/ directory in their /gpt-neox/ folder * Update README.md Makes the code blocks into blocks in the changed parts * Make the docker-compose spinup tidier * Avoid config bloat by only providing the updated paths * Apply precommit --------- Co-authored-by: Quentin Anthony <[email protected]>
EleutherAI · Jan 9, 2024 · e6e944a · e6e944a
1 parent f14782a
commit e6e944a
Show file tree

Hide file tree

Showing 7 changed files with 134 additions and 11 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -26,11  26,11 @@ LABEL org.opencontainers.image.base.name="docker.io/nvidia/cuda:11.7.1-devel-ubu
 #### System package (uses default Python 3 version in Ubuntu 20.04)
 RUN apt-get update -y && \
     apt-get install -y \
-        git python3.9 python3-dev libpython3-dev python3-pip sudo pdsh \
-        htop llvm-9-dev tmux zstd software-properties-common build-essential autotools-dev \
-        nfs-common pdsh cmake g   gcc curl wget vim less unzip htop iftop iotop ca-certificates ssh \
-        rsync iputils-ping net-tools libcupti-dev libmlx4-1 infiniband-diags ibutils ibverbs-utils \
-        rdmacm-utils perftest rdma-core nano && \
     git python3.9 python3-dev libpython3-dev python3-pip sudo pdsh \
     htop llvm-9-dev tmux zstd software-properties-common build-essential autotools-dev \
     nfs-common pdsh cmake g   gcc curl wget vim less unzip htop iftop iotop ca-certificates ssh \
     rsync iputils-ping net-tools libcupti-dev libmlx4-1 infiniband-diags ibutils ibverbs-utils \
     rdmacm-utils perftest rdma-core nano && \
     update-alternatives --install /usr/bin/python python /usr/bin/python3 1 && \
     update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1 && \
     pip install --upgrade pip && \

diff --git a/README.md b/README.md
@@ -225,11  225,69 @@ You can then kick off a training run with `sbatch my_sbatch_script.sh`
 
 ### Containerized Setup
 
-We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named `gpt-neox` from the repository root directory with `docker build -t gpt-neox -f Dockerfile .`. We also host pre-built images on [Docker Hub at `leogao2/gpt-neox`](https://hub.docker.com/r/leogao2/gpt-neox/tags).
 We also provide a Dockerfile and docker-compose configuration if you prefer to run NeoX in a container.
 
 Requirements to run the container are to have appropriate GPU drivers, an up-to-date installation of Docker, and [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed. To test if your installation is good you can use their "sample workload", which is:
 
 ```
 docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
 ```
 
 Provided that will run, you need to export NEOX_DATA_PATH and NEOX_CHECKPOINT_PATH in your environment to specify your data directory and directory for storing and loading checkpoints:
 
 ```
 export NEOX_DATA_PATH=/mnt/sda/data/enwiki8 #or wherever your data is stored on your system
 export NEOX_CHECKPOINT_PATH=/mnt/sda/checkpoints
 ```
 
 And then, from the gpt-neox directory, you can build the image and run a shell in a container with
 
 ```
 docker compose run gpt-neox bash
 ```
 
 After the build, you should be able to do this:
 ```
 mchorse@537851ed67de:~$ echo $(pwd)
 /home/mchorse
 mchorse@537851ed67de:~$ ls -al
 total 48
 drwxr-xr-x  1 mchorse mchorse 4096 Jan  8 05:33 .
 drwxr-xr-x  1 root    root    4096 Jan  8 04:09 ..
 -rw-r--r--  1 mchorse mchorse  220 Feb 25  2020 .bash_logout
 -rw-r--r--  1 mchorse mchorse 3972 Jan  8 04:09 .bashrc
 drwxr-xr-x  4 mchorse mchorse 4096 Jan  8 05:35 .cache
 drwx------  3 mchorse mchorse 4096 Jan  8 05:33 .nv
 -rw-r--r--  1 mchorse mchorse  807 Feb 25  2020 .profile
 drwxr-xr-x  2 root    root    4096 Jan  8 04:09 .ssh
 drwxrwxr-x  8 mchorse mchorse 4096 Jan  8 05:35 chk
 drwxrwxrwx  6 root    root    4096 Jan  7 17:02 data
 drwxr-xr-x 11 mchorse mchorse 4096 Jan  8 03:52 gpt-neox
 ```
 
 For a long-running job, you should run
 
 ```
 docker compose up -d
 ```
 
 to run the container in detached mode, and then, in a separate terminal session, run
 
 ```
 docker compose exec gpt-neox bash
 ```
 
 You can then run any job you want from inside the container.
 
 Concerns when running for a long time or in detached mode include
  - You will have to terminate the container manually when you are no longer using it
  - If you want processes to continue running when your shell session ends, you will need to background them.
  - If you then want logging, you will have to make sure to pipe logs to disk or set up wandb.
 
 If you prefer to run the prebuilt container image from dockerhub, you can run the docker compose commands with ```-f docker-compose-dockerhub.yml``` instead, e.g.,
 
-You can then run a container based on this image. For instance, the below snippet mounts the cloned repository (`gpt-neox`) directory to `/gpt-neox` in the container and uses [nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to make four GPUs (numbers 0-3) accessible to the container. [As noted by the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data), both `--shm-size=1g` and `--ulimit memlock=-1` are important to prevent Docker from allocating too little shared memory.
 ```
-nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox gpt-neox
 docker compose run -f docker-compose-dockerhub.yml gpt-neox bash
 ```
 
 ## Usage

diff --git a/configs/docker/paths.yml b/configs/docker/paths.yml
@@ -0,0  1,12 @@
 {
   "train-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
   "valid-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
   "test-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
 
   "tokenizer-type": "HFTokenizer",
   "vocab-file": "/home/mchorse/data/tokenizers/20B_tokenizer.json",
 
   "save": "/home/mchorse/chk/",
   "load": "/home/mchorse/chk/",
   "checkpoint_validation_with_forward_pass": False
 }
diff --git a/configs/pythia/14M.yml b/configs/pythia/14M.yml
@@ -14,7  14,7 @@
   "no-weight-tying": true,
   "gpt-j-residual": true,
   "output-layer-parallelism": "column",
-  
 
   "attention-config": [[["flash"], 6]],
 
   "scaled-upper-triang-masked-softmax-fusion": true,

diff --git a/configs/pythia/31M.yml b/configs/pythia/31M.yml
@@ -14,7  14,7 @@
   "no-weight-tying": true,
   "gpt-j-residual": true,
   "output-layer-parallelism": "column",
-  
 
   "attention-config": [[["flash"], 6]],
 
   "scaled-upper-triang-masked-softmax-fusion": true,
@@ -54,7  54,7 @@
   # activation checkpointing
   "checkpoint-activations": false,
   "checkpoint-num-layers": 1,
-  "partition-activations": false, 
   "partition-activations": false,
   "synchronize-each-layer": true,
 
   # regularization

diff --git a/docker-compose-dockerhub.yml b/docker-compose-dockerhub.yml
@@ -0,0  1,25 @@
 version: '3'
 services:
   gpt-neox:
     command: nvidia-smi -q --loop=10
     image: leogao2/gpt-neox:main
     shm_size: 1g
     ulimits:
       memlock:
         soft: -1
         hard: -1
     runtime: nvidia
     deploy:
       resources:
         reservations:
           devices:
             - driver: nvidia
               capabilities: [gpu]
     logging:
       options:
         max-size: "100m"
         max-file: "3"
     volumes:
       - ${NEOX_DATA_PATH}:/home/mchorse/data
       - ${NEOX_CHECKPOINT_PATH}:/home/mchorse/chk
       - .:/home/mchorse/gpt-neox
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0  1,28 @@
 version: '3'
 services:
   gpt-neox:
     command: nvidia-smi -q --loop=10
     image: gpt-neox
     build:
       context: .
       dockerfile: Dockerfile
     shm_size: 1g
     ulimits:
       memlock:
         soft: -1
         hard: -1
     runtime: nvidia
     deploy:
       resources:
         reservations:
           devices:
             - driver: nvidia
               capabilities: [gpu]
     logging:
       options:
         max-size: "100m"
         max-file: "3"
     volumes:
       - ${NEOX_DATA_PATH}:/home/mchorse/data
       - ${NEOX_CHECKPOINT_PATH}:/home/mchorse/chk
       - .:/home/mchorse/gpt-neox