mldev

Notes and examples on setting up a reproducible ML dev environment.

Core Components (Tested Version)

Ubuntu (22.04)
NVIDIA driver (535)
Docker (24.06)
NVIDIA Container Toolkit (1.14.2)
NVIDIA GPU Cloud (NGC) Container (nvcr.io/nvidia/pytorch:23.09-py3)

Setup

Install NVIDIA driver

sudo apt-get install nvidia-driver-535

After installing the NVIDIA driver, the nvidia-smi command should show CUDA version 12.2,

 --------------------------------------------------------------------------------------- 
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |

Install Docker

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04

After installing Docker, you should be able to run the hello world image,

docker run hello-world

Install NVIDIA Container Toolkit

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

install-nct.sh

After installing NVIDIA Container Toolkit, you should be able to run nvidia-smi from within a docker container,

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

docker run --rm --gpus all ubuntu nvidia-smi

NVIDIA Docker Containers

NVIDIA GPU Cloud (NGC) provides many Docker containers,

https://catalog.ngc.nvidia.com/orgs/nvidia/containers

We tested with the nvcr.io/nvidia/pytorch:23.09-py3 container

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags

A set of default base flags for docker run are,

--gpus all
--ipc=host or --shm-size 1gb
--ulimit memlock=-1
--ulimit stack=67108864

An example interactive session that will remove the container on exit is,

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/pytorch:23.09-py3

Customize NVIDIA base Docker container

see Dockerfile

Notes on efficient loading / training / inference

https://huggingface.co/docs/transformers/perf_train_gpu_one https://huggingface.co/docs/transformers/perf_infer_gpu_one https://huggingface.co/docs/transformers/perf_infer_cpu

https://huggingface.co/blog/hf-bitsandbytes-integration https://huggingface.co/blog/4bit-transformers-bitsandbytes

https://huggingface.co/docs/transformers/main_classes/quantization

HF Llama models

https://huggingface.co/blog/llama2

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
code/load_llamas		code/load_llamas
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
env.list		env.list
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mldev

Core Components (Tested Version)

Setup

Install NVIDIA driver

Install Docker

Install NVIDIA Container Toolkit

NVIDIA Docker Containers

Customize NVIDIA base Docker container

Notes on efficient loading / training / inference

HF Llama models

About

Releases

Packages

Languages

License

galtay/mldev

Folders and files

Latest commit

History

Repository files navigation

mldev

Core Components (Tested Version)

Setup

Install NVIDIA driver

Install Docker

Install NVIDIA Container Toolkit

NVIDIA Docker Containers

Customize NVIDIA base Docker container

Notes on efficient loading / training / inference

HF Llama models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages