DCGM-Exporter

This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

Documentation

Official documentation for DCGM-Exporter can be found on docs.nvidia.com.

Quickstart

To gather metrics on a GPU node, simply start the dcgm-exporter container:

$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Quickstart on Kubernetes

Note: Consider using the NVIDIA GPU Operator rather than DCGM-Exporter directly.

Ensure you have already setup your cluster with the default runtime as NVIDIA.

The recommended way to install DCGM-Exporter is to use the Helm chart:

$ helm repo add gpu-helm-charts \
  https://nvidia.github.io/dcgm-exporter/helm-charts

Update the repo:

$ helm repo update

And install the chart:

$ helm install \
    --generate-name \
    gpu-helm-charts/dcgm-exporter

Once the dcgm-exporter pod is deployed, you can use port forwarding to obtain metrics quickly:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

# Let's get the output of a random pod:
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
                         -o "jsonpath={ .items[0].metadata.name}")

$ kubectl port-forward $NAME 8080:9400 &
$ curl -sL http://127.0.0.1:8080/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...

To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide. dcgm-exporter is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator user guide.

TLS and Basic Auth

Exporter supports TLS and basic auth using exporter-toolkit. To use TLS and/or basic auth, users need to use --web-config-file CLI flag as follows

dcgm-exporter --web-config-file=web-config.yaml

A sample web-config.yaml file can be fetched from exporter-toolkit repository. The reference of the web-config.yaml file can be consulted in the docs.

Building from Source

In order to build dcgm-exporter ensure you have the following:

$ git clone https://github.com/NVIDIA/dcgm-exporter.git
$ cd dcgm-exporter
$ make binary
$ sudo make install
...
$ dcgm-exporter &
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Changing Metrics

With dcgm-exporter you can configure which fields are collected by specifying a custom CSV file. You will find the default CSV file under etc/default-counters.csv in the repository, which is copied on your system or container to /etc/dcgm-exporter/default-counters.csv

The layout and format of this file is as follows:

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

A custom csv file can be specified using the -f option or --collectors as follows:

$ dcgm-exporter -f /tmp/custom-collectors.csv

Notes:

Always make sure your entries have 2 commas (',')
The complete list of counters that can be collected can be found on the DCGM API reference manual: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html

What about a Grafana Dashboard?

You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239

You will also find the json file on this repo under grafana/dcgm-exporter-dashboard.json

Pull requests are accepted!

Building the containers

This project uses docker buildx for multi-arch image creation. Follow the instructions on that page to get a working builder instance for creating these containers. Some other useful build options follow.

Builds local images based on the machine architecture and makes them available in 'docker images'

make local

Build the ubuntu image and export to 'docker images'

make ubuntu22.04 PLATFORMS=linux/amd64 OUTPUT=type=docker

Build and push the images to some other 'private_registry'

make REGISTRY=<private_registry> push

Issues and Contributing

Checkout the Contributing document!

Please let us know by filing a new issue
You can contribute by opening a pull request

Reporting Security Issues

We ask that all community members and users of DCGM Exporter follow the standard NVIDIA process for reporting security vulnerabilities. This process is documented at the NVIDIA Product Security website. Following the process will result in any needed CVE being created as well as appropriate notifications being communicated to the entire DCGM Exporter community. NVIDIA reserves the right to delete vulnerability reports until they're fixed.

Please refer to the policies listed there to answer questions related to reporting security issues.

Princeton University changes

In an attempt to replace our modified nvidia_gpu_exporter the following changes were made to the dcgm exporter.

Aliased metrics

As our current exporter uses differently named metrics, sometimes with different units, e.g. mW vs W for power consumption or bytes instead of megabytes for memory use, added a way to add metrics that are based on metrics already DCGM collects. To use this feature take a standard metric as defined in default-counters.csv, e.g.:

DCGM_FI_DEV_FB_TOTAL, gauge, Frame buffer memory total (in MB).

and append (comma separated) new metric name, its description and multiplier, e.g.:

DCGM_FI_DEV_FB_TOTAL, gauge, Frame buffer memory total (in MB)., nvidia_gpu_memory_total_bytes, Total memory of the GPU device in bytes, 1048576

Collect slurm jobid and user running on the particular GPU

This feature relies on the existence of /run/gpustat/GPU-UUID (say /run/gpustat/GPU-8b4054a4-c830-20d4-1111-222222222222) or /run/gpustat/MIG-UUID (say /run/gpustat/MIG-2201f4b1-a001-5ae1-87df-c6ef1d8adfab) containing space separated jobid and uidnumber, e.g.:

[root@della-l01g2 ~]# cat /run/gpustat/MIG-2201f4b1-a001-5ae1-87df-c6ef1d8adfab
51234567 123456

This information will be appeneded as labels to appropriate metrics, e.g.:

DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-d6dd33b9-e50e-997c-f303-c8f7312fa498",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="7",Hostname="della-l01g1",DCGM_FI_DRIVER_VERSION="535.104.05",jobid="51234567",userid="123456"} 8543

as well as added as a separate metrics:

nvidia_gpu_jobId{minor_number="0",name="NVIDIA A100 80GB PCIe",uuid="GPU-d6dd33b9-e50e-997c-f303-c8f7312fa498"} 51234567
nvidia_gpu_jobUid{minor_number="0",name="NVIDIA A100 80GB PCIe",uuid="GPU-d6dd33b9-e50e-997c-f303-c8f7312fa498"} 123456

same as in our previously mentioned nvidia_gpu_exporter.

Name		Name	Last commit message	Last commit date
Latest commit History 452 Commits
.github		.github
.vscode		.vscode
cmd/dcgm-exporter		cmd/dcgm-exporter
deployment		deployment
docker		docker
etc		etc
grafana		grafana
internal		internal
pkg		pkg
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
dcgm-exporter.yaml		dcgm-exporter.yaml
go.mod		go.mod
go.sum		go.sum
secuity.md		secuity.md
service-monitor.yaml		service-monitor.yaml
staticcheck.conf		staticcheck.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCGM-Exporter

Documentation

Quickstart

Quickstart on Kubernetes

TLS and Basic Auth

Building from Source

Changing Metrics

What about a Grafana Dashboard?

Building the containers

Issues and Contributing

Reporting Security Issues

Princeton University changes

Aliased metrics

Collect slurm jobid and user running on the particular GPU

About

Releases

Packages

Languages

License

plazonic/dcgm-exporter

Folders and files

Latest commit

History

Repository files navigation

DCGM-Exporter

Documentation

Quickstart

Quickstart on Kubernetes

TLS and Basic Auth

Building from Source

Changing Metrics

What about a Grafana Dashboard?

Building the containers

Issues and Contributing

Reporting Security Issues

Princeton University changes

Aliased metrics

Collect slurm jobid and user running on the particular GPU

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages