Jobstats is a job monitoring platform composed of data exporters, Prometheus, Grafana and the Slurm database whereas jobstats
is a command that operates on the Jobstats platform. If you are looking to setup the Jobstats platform then see below and this manuscript.
See our PEARC 2023 presentation: "Jobstats: A Slurm-Compatible Job Monitoring Platform for CPU and GPU Clusters" (PDF). Here is our PEARC 2024 poster (PDF).
The jobstats
command provides users with a Slurm job efficiency report for a given jobid:
$ jobstats 39798795
================================================================================
Slurm Job Statistics
================================================================================
Job ID: 39798795
NetID/Account: aturing/math
Job Name: sys_logic_ordinals
State: COMPLETED
Nodes: 2
CPU Cores: 48
CPU Memory: 256GB (5.3GB per CPU-core)
GPUs: 4
QOS/Partition: della-gpu/gpu
Cluster: della
Start Time: Fri Mar 4, 2022 at 1:56 AM
Run Time: 18:41:56
Time Limit: 4-00:00:00
Overall Utilization
================================================================================
CPU utilization [||||| 10%]
CPU memory usage [||| 6%]
GPU utilization [|||||||||||||||||||||||||||||||||| 68%]
GPU memory usage [||||||||||||||||||||||||||||||||| 66%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%)
della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%)
Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9%
CPU memory usage per node - used/allocated
della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24)
della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24)
Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48)
GPU utilization per node
della-i14g2 (GPU 0): 65.7%
della-i14g2 (GPU 1): 64.5%
della-i14g3 (GPU 0): 72.9%
della-i14g3 (GPU 1): 67.5%
GPU memory usage per node - maximum used/total
della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%)
della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%)
della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%)
della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%)
Notes
================================================================================
* This job only used 6% of the 256GB of total allocated CPU memory. For
future jobs, please allocate less memory by using a Slurm directive such
as --mem-per-cpu=1G or --mem=10G. This will reduce your queue times and
make the resources available to other users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/memory
* This job only needed 19% of the requested time which was 4-00:00:00. For
future jobs, please request less time by modifying the --time Slurm
directive. This will lower your queue times and allow the Slurm job
scheduler to work more effectively for all users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/slurm
* For additional job metrics including metrics plotted against time:
https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus)
For completed jobs, the data is taken from a call to sacct with several fields including AdminComment. For running jobs, the Prometheus database must be queried.
Importantly, the jobstats
command is also used to replace smail
, which is the Slurm executable used for sending email reports that are based on seff
. This means that users receive emails that are the exact output of jobstats
including the notes.
The installation requirements for jobstats
are Python 3.6 , Requests 2.20 and (optionally) blessed 1.17 which can be used for coloring and styling text.
The necessary software can be installed as follows:
$ conda create --name js-env python=3.7 requests blessed -c conda-forge
After setting up the Jobstats platform (see below), to start using the jobstats
command on your system, run these commands:
$ git clone https://github.com/PrincetonUniversity/jobstats.git
$ cd jobstats
# use a text editor to create config.py (see the example configuration file below)
$ chmod u x jobstats
$ ./jobstats 1234567
Use the config.py
below as the starting point for your configuration file:
##########################
## JOBSTATS CONFIG FILE ##
##########################
# prometheus server address and port
PROM_SERVER = "http://vigilant2:8480"
# number of seconds between measurements
SAMPLING_PERIOD = 30
# threshold values for red versus black notes
GPU_UTIL_RED = 15 # percentage
GPU_UTIL_BLACK = 25 # percentage
CPU_UTIL_RED = 65 # percentage
CPU_UTIL_BLACK = 80 # percentage
TIME_EFFICIENCY_RED = 40 # percentage
TIME_EFFICIENCY_BLACK = 70 # percentage
MIN_MEMORY_USAGE = 70 # percentage
MIN_RUNTIME_SECONDS = 10 * SAMPLING_PERIOD # seconds
# translate cluster names in Slurm DB to informal names
CLUSTER_TRANS = {"tiger":"tiger2"}
#CLUSTER_TRANS = {} # if no translations then use an empty dictionary
CLUSTER_TRANS_INV = dict(zip(CLUSTER_TRANS.values(), CLUSTER_TRANS.keys()))
# maximum number of characters to display in jobname
MAX_JOBNAME_LEN = 64
# default CPU memory per core in bytes for each cluster
# if unsure then use memory per node divided by cores per node
DEFAULT_MEM_PER_CORE = {"adroit":3355443200,
"della":4194304000,
"stellar":7864320000,
"tiger":4294967296,
"traverse":7812500000}
# number of CPU-cores per node for each cluster
# this will eventually be replaced with explicit values for each node
CORES_PER_NODE = {"adroit":32,
"della":28,
"stellar":96,
"tiger":40,
"traverse":32}
#########################################################################################
## C U S T O M N O T E S ##
## ##
## Be sure to work from the examples. Pay attention to the different quote characters ##
## when f-strings are involved. ##
#########################################################################################
NOTES = []
# zero GPU utilization (single GPU jobs)
condition = 'self.gpus and (self.diff > c.MIN_RUNTIME_SECONDS) and num_unused_gpus > 0 ' \
'and self.gpus == 1'
note = ("This job did not use the GPU. Please resolve this " \
"before running additional jobs. Wasting " \
"resources prevents other users from getting their work done " \
"and it causes your subsequent jobs to have a lower priority. " \
"Is the code GPU-enabled? " \
"Please consult the documentation for the software. For more info:",
"https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing")
style = "bold-red"
NOTES.append((condition, note, style))
# zero GPU utilization (multi-GPU jobs)
condition = 'self.gpus and (self.diff > c.MIN_RUNTIME_SECONDS) and num_unused_gpus > 0 ' \
'and self.gpus > 1'
note = ('f"This job did not use {num_unused_gpus} of the {self.gpus} allocated GPUs. "' \
'"Please resolve this before running additional jobs. "' \
'"Wasting resources prevents other users from getting their work done "' \
'"and it causes your subsequent jobs to have a lower priority. Is the "' \
'"code capable of using multiple GPUs? Please consult the documentation for "' \
'"the software. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing")
style = "bold-red"
NOTES.append((condition, note, style))
# low GPU utilization (ondemand and salloc)
condition = '(not zero_gpu) and self.gpus and (self.gpu_utilization <= c.GPU_UTIL_RED) ' \
'and interactive_job and (self.diff / SECONDS_PER_HOUR > 12)'
note = ('f"The overall GPU utilization of this job is only {round(self.gpu_utilization)}%. "' \
'f"This value is low compared to the cluster mean value of 50%. Please "' \
'f"do not create \"salloc\" or OnDemand sessions for more than 12 hours unless you "' \
'f"plan to work intensively during the entire period. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing#util")
style = "bold-red"
NOTES.append((condition, note, style))
# low GPU utilization (batch jobs)
condition = '(not zero_gpu) and self.gpus and (self.gpu_utilization <= c.GPU_UTIL_RED) ' \
'and (not interactive_job)'
note = ('f"The overall GPU utilization of this job is only {round(self.gpu_utilization)}%. "' \
'"This value is low compared to the cluster mean value of 50%. Please "' \
'"investigate the reason for the low utilization. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing#util")
style = "bold-red"
NOTES.append((condition, note, style))
# low CPU utilization (black, more than one core)
condition = '(not zero_cpu) and (not self.gpus) and (self.cpu_efficiency <= c.CPU_UTIL_BLACK) ' \
'and (self.cpu_efficiency > c.CPU_UTIL_RED) and int(self.ncpus) > 1'
note = ('f"The overall CPU utilization of this job is {ceff}%. This value "' \
'f"is{somewhat}low compared to the target range of "' \
'f"90% and above. Please investigate the reason for the low efficiency. "' \
'"For instance, have you conducted a scaling analysis? For more info:"',
"https://researchcomputing.princeton.edu/get-started/cpu-utilization")
style = "normal"
NOTES.append((condition, note, style))
# low CPU utilization (red, more than one core)
condition = '(not zero_cpu) and (not self.gpus) and (self.cpu_efficiency < c.CPU_UTIL_RED) ' \
'and (int(self.ncpus) > 1)'
note = ('f"The overall CPU utilization of this job is {ceff}%. This value "' \
'f"is{somewhat}low compared to the target range of "' \
'f"90% and above. Please investigate the reason for the low efficiency. "' \
'"For instance, have you conducted a scaling analysis? For more info:"',
"https://researchcomputing.princeton.edu/get-started/cpu-utilization")
style = "bold-red"
NOTES.append((condition, note, style))
# low CPU utilization (black, serial job)
condition = '(not zero_cpu) and (not self.gpus) and (self.cpu_efficiency <= c.CPU_UTIL_BLACK) ' \
'and (self.cpu_efficiency > c.CPU_UTIL_RED) and int(self.ncpus) == 1'
note = ('f"The overall CPU utilization of this job is {ceff}%. This value "' \
'f"is{somewhat}low compared to the target range of "' \
'f"90% and above. Please investigate the reason for the low efficiency. "' \
'"For more info:"',
"https://researchcomputing.princeton.edu/get-started/cpu-utilization")
style = "normal"
NOTES.append((condition, note, style))
# low CPU utilization (red, serial job)
condition = '(not zero_cpu) and (not self.gpus) and (self.cpu_efficiency < c.CPU_UTIL_RED) ' \
'and (int(self.ncpus) == 1)'
note = ('f"The overall CPU utilization of this job is {ceff}%. This value "' \
'f"is{somewhat}low compared to the target range of "' \
'f"90% and above. Please investigate the reason for the low efficiency. "' \
'"For more info:"',
"https://researchcomputing.princeton.edu/get-started/cpu-utilization")
style = "bold-red"
NOTES.append((condition, note, style))
# out of memory
condition = 'self.state == "OUT_OF_MEMORY"'
note = ("This job failed because it needed more CPU memory than the amount that " \
"was requested. The solution is to resubmit the job while " \
"requesting more CPU memory by " \
"modifying the --mem-per-cpu or --mem Slurm directive. For more info: ",
"https://researchcomputing.princeton.edu/support/knowledge-base/memory")
style = "bold-red"
NOTES.append((condition, note, style))
# timeout
condition = 'self.state == "TIMEOUT"'
note = ("This job failed because it exceeded the time limit. If there are no " \
"other problems then the solution is to increase the value of the " \
"--time Slurm directive and resubmit the job. For more info:",
"https://researchcomputing.princeton.edu/support/knowledge-base/slurm")
style = "bold-red"
NOTES.append((condition, note, style))
# excessive run time limit (red)
condition = 'self.time_eff_violation and self.time_efficiency <= c.TIME_EFFICIENCY_RED'
note = ('f"This job only needed {self.time_efficiency}% of the requested time "' \
'f"which was {self.human_seconds(SECONDS_PER_MINUTE * self.timelimitraw)}. "' \
'"For future jobs, please request less time by modifying "' \
'"the --time Slurm directive. This will "' \
'"lower your queue times and allow the Slurm job scheduler to work more "' \
'"effectively for all users. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/slurm")
style = "bold-red"
NOTES.append((condition, note, style))
# excessive run time limit (black)
condition = 'self.time_eff_violation and self.time_efficiency > c.TIME_EFFICIENCY_RED'
note = ('f"This job only needed {self.time_efficiency}% of the requested time "' \
'f"which was {self.human_seconds(SECONDS_PER_MINUTE * self.timelimitraw)}. "' \
'"For future jobs, please request less time by modifying "' \
'"the --time Slurm directive. This will "' \
'"lower your queue times and allow the Slurm job scheduler to work more "' \
'"effectively for all users. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/slurm")
style = "normal"
NOTES.append((condition, note, style))
# somewhat low GPU utilization
condition = '(not zero_gpu) and self.gpus and (self.gpu_utilization < c.GPU_UTIL_BLACK) and ' \
'(self.gpu_utilization > c.GPU_UTIL_RED) and (self.diff > c.MIN_RUNTIME_SECONDS)'
note = ('f"The overall GPU utilization of this job is {round(self.gpu_utilization)}%. "' \
'"This value is somewhat low compared to the cluster mean value of 50%. For more info:"',
'https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing#util')
style = "normal"
NOTES.append((condition, note, style))
# excess CPU memory
condition = '(not zero_gpu) and (not zero_cpu) and (cpu_memory_utilization < c.MIN_MEMORY_USAGE) ' \
'and (gb_per_core > (mpc / 1024**3) - 2) and (total > mpc) and gpu_show and ' \
'(not self.partition == "datascience") and (not self.partition == "mig") and ' \
'(self.state != "OUT_OF_MEMORY") and (cores_per_node < cpn) and ' \
'(self.diff > c.MIN_RUNTIME_SECONDS)'
note = ('f"This job {opening} of the {self.cpu_memory_formatted(with_label=False)} "' \
'"of total allocated CPU memory. "' \
'"For future jobs, please allocate less memory by using a Slurm directive such "' \
'f"as --mem-per-cpu={self.rounded_memory_with_safety(gb_per_core_used)}G or "' \
'f"--mem={self.rounded_memory_with_safety(gb_per_node_used)}G. "' \
'"This will reduce your queue times and make the resources available to "' \
'"other users. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/memory")
style = "normal"
NOTES.append((condition, note, style))
# serial jobs wasting multiple cpu-cores
condition = '(self.nnodes == "1") and (int(self.ncpus) > 1) and (not self.gpus) and (serial_ratio > 0.85 ' \
'and serial_ratio < 1.1)'
note = ('f"The CPU utilization of this job ({self.cpu_efficiency}%) is{approx}equal "' \
'"to 1 divided by the number of allocated CPU-cores "' \
'f"(1/{self.ncpus}={round(eff_if_serial)}%). This suggests that you may be "' \
'"running a code that can only use 1 CPU-core. If this is true then "' \
'"allocating more than 1 CPU-core is wasteful. Please consult the "' \
'"documentation for the software to see if it is parallelized. For more info:"',
"https://researchcomputing.princeton.edu/support/knowledge-base/parallel-code")
style = "normal"
NOTES.append((condition, note, style))
# job ran in the test queue
condition = '"test" in self.qos or "debug" in self.qos'
note = ('f"This job ran in the {self.qos} QOS. Each user can only run a small number of "' \
'"jobs simultaneously in this QOS. For more info:"',
'https://researchcomputing.princeton.edu/support/knowledge-base/job-priority#test-queue')
style = "normal"
NOTES.append((condition, note, style))
# more details for della
condition = '(self.cluster == "della")'
note = ("For additional job metrics including metrics plotted against time:",
"https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus)")
style = "normal"
NOTES.append((condition, note, style))
# more details for adroit
condition = '(self.cluster == "adroit")'
note = ("For additional job metrics including metrics plotted against time:",
"https://myadroit.princeton.edu/pun/sys/jobstats (VPN required off-campus)")
style = "normal"
NOTES.append((condition, note, style))
# example of a simple note that is always displayed
condition = 'True'
note = "Have a nice day!"
style = "normal"
NOTES.append((condition, note, style))
Below is an outline of the steps that need to be taken to setup the Jobstats platform for a Slurm cluster:
- Switch to cgroup based job accounting from Linux process accounting
- Setup the exporters: cgroup, node, GPU (on the nodes) and, optionally, GPFS (centrally)
- Setup the prolog.d and epilog.d scripts on the GPU nodes
- Setup the Prometheus server and configure it to scrape data from the compute nodes and all configured exporters
- Setup the slurmctldepilog.sh script for long-term job summary retention
- Lastly, configure Grafana and Open OnDemand
We use these four exporters:
- node exporter: https://github.com/prometheus/node_exporter
- cgroup exporter: https://github.com/plazonic/cgroup_exporter
- nvidia gpu exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter
- gpfs exporter: https://github.com/plazonic/gpfs-exporter
What follows is an example of production configuration used for the Tiger cluster that has both regular and GPU nodes.
---
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: master
- job_name: Tiger Nodes
scrape_interval: 30s
scrape_timeout: 30s
file_sd_configs:
- files:
- "/etc/prometheus/local_files_sd_config.d/tigernodes.json"
metric_relabel_configs:
- source_labels:
- __name__
regex: "^go_.*"
action: drop
- job_name: TigerGPU Nodes
scrape_interval: 30s
scrape_timeout: 30s
file_sd_configs:
- files:
- "/etc/prometheus/local_files_sd_config.d/tigergpus.json"
metric_relabel_configs:
- source_labels:
- __name__
regex: "^go_.*"
action: drop
tigernode.json looks like:
[
{
"labels": {
"cluster": "tiger"
},
"targets": [
"tiger-h19c1n10:9100",
"tiger-h19c1n10:9306",
...
]
}
]
where both node_exporter (port 9100) and cgroup_exporter (port 9306) are listed, for all of tiger's nodes. tigergpus.json looks very similar except that it collects data from nvidia_gpu_prometheus_exporter on port 9445.
Note the additional label cluster.
In order to correctly track which GPU is assigned to which jobid we use slurm prolog and epilog scripts to create files in /run/gpustat
directory named either after GPU ordinal number (0, 1, ..) or, in the case of MIG cards, MIG-UUID. These files contain space separated jobid and uid number of the user. E.g.
# cat /run/gpustat/MIG-265a219d-a49f-578a-825d-222c72699c16
45916256 262563
These two scripts can be found in the slurm directory. For example, slurm/epilog.d/gpustats_helper.sh could be installed as /etc/slurm/epilog.d/gpustats_helper.sh and slurm/prolog.d/gpustats_helper.sh as /etc/slurm/prolog.d/gpustats_helper.sh with these slurm.conf config statements:
Prolog=/etc/slurm/prolog.d/*.sh
Epilog=/etc/slurm/epilog.d/*.sh
Grafana dashboard json that uses all of the exporters is included in the grafana subdirectory. It expects one parameter, jobid. As it may not be easy to find the time range we also use an ondemand job stats helper that generates the correct time range given a jobid, documented in the next section.
The following image illustrates what the dashboard looks like in use:
ood-jobstats-helper subdirectory contains an Open OnDemand app that, given a job id, uses sacct to generate a full Grafana URL with job's jobid, start and end times.
Job summaries, as described above, are generated and stored in the Slurm database at the end of each job by using slurmctld epilog script, e.g.:
EpilogSlurmctld=/usr/local/sbin/slurmctldepilog.sh
The script can be found in the slurm subdirectory, named "slurmctldepilog.sh".
For processing old jobs where slurmctld epilog script did not run or for jobs where it failed there is a per cluster ingest jobstats service. This is a python based script running on the slurmdbd host, as a systemd timer and service, querying and modifying slurm database directly. The script (ingest_jobstats.py) and systemd timer and service scripts are in the slurm directory.
We made heavy use of this script to generate job summaries for older jobs but with the current version of the Epilog script it should not be needed anymore.
We use slurm/jobstats_mail.sh as the slurm's Mail program. E.g. from slurm.conf:
MailProg=/usr/local/bin/jobstats_mail.sh
This will include jobstats information for jobs that have requested email notifications on completion.
The jobstats
command analyzes each job and produces custom notes at the bottom of the output. Below are several examples:
* This job ran on the mig partition where each job is limited to 1 MIG
GPU, 1 CPU-core, 10 GB of GPU memory and 32 GB of CPU memory. A MIG GPU
is about 1/7th as powerful as an A100 GPU. Please continue using the mig
partition when possible. For more info:
https://researchcomputing.princeton.edu/systems/della#gpus
* This job completed while only needing 19% of the requested time which
was 2-00:00:00. For future jobs, please decrease the value of the --time
Slurm directive. This will lower your queue times and allow the Slurm
job scheduler to work more effectively for all users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/slurm
* This job did not use the GPU. Please resolve this before running
additional jobs. Wasting resources prevents other users from getting
their work done and it causes your subsequent jobs to have a lower
priority. Is the code GPU-enabled? Please consult the documentation for
the code. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing
* This job only used 15% of the 100GB of total allocated CPU memory.
Please consider allocating less memory by using the Slurm directive
--mem-per-cpu=3G or --mem=18G. This will reduce your queue times and
make the resources available to other users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/memory
* This job ran on a large-memory (datascience) node but it only used 117
GB of CPU memory. The large-memory nodes should only be used for jobs
that require more than 190 GB. Please allocate less memory by using the
Slurm directive --mem-per-cpu=9G or --mem=150G. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/memory
* The CPU utilization of this job (24%) is approximately equal to 1
divided by the number of allocated CPU-cores (1/4=25%). This suggests
that you may be running a code that can only use 1 CPU-core. If this is
true then allocating more than 1 CPU-core is wasteful. Please consult
the documentation for the software to see if it is parallelized. For
more info:
https://researchcomputing.princeton.edu/support/knowledge-base/parallel-code
* This job did not use the CPU. This suggests that something went wrong at
the very beginning of the job. Check your Slurm script for errors and
look for useful information in the file slurm-46987157.out if it exists.
* The Tiger cluster is intended for jobs that require multiple nodes. This
job ran in the serial partition where jobs are assigned the lowest
priority. On Tiger, a job will run in the serial partition if it only
requires 1 node. Consider carrying out this work elsewhere.
* For additional job metrics including metrics plotted against time:
https://mystellar.princeton.edu/pun/sys/jobstats (VPN required off-campus)
* For additional job metrics including metrics plotted against time:
https://stats.rc.princeton.edu (VPN required off-campus)
One can also output the raw JSON:
$ jobstats -j 39798795 | jq
{
"gpus": 4,
"nodes": {
"della-i14g2": {
"cpus": 24,
"gpu_total_memory": {
"0": 42949672960,
"1": 42949672960
},
"gpu_used_memory": {
"0": 28453568512,
"1": 28453568512
},
"gpu_utilization": {
"0": 65.7,
"1": 64.5
},
"total_memory": 137438953472,
"total_time": 164480.1,
"used_memory": 8444272640
},
"della-i14g3": {
"cpus": 24,
"gpu_total_memory": {
"0": 42949672960,
"1": 42949672960
},
"gpu_used_memory": {
"0": 28453634048,
"1": 28453634048
},
"gpu_utilization": {
"0": 72.9,
"1": 67.5
},
"total_memory": 137438953472,
"total_time": 154135.9,
"used_memory": 8419606528
}
},
"total_time": 67316
}
In addition to jobstats
, the following software tools build on the Jobstats platform:
- gpudash - A command for generating a text-based GPU utilization dashboard
- job defense shield - A tool for sending automated email alerts to users
- reportseff - A command for displaying Slurm efficiency reports for several jobs at once
- utilization reports - A tool for sending detailed usage reports to users by email
- Brown University
- Free University of Berlin
- Princeton Computer Science
- Princeton Research Computing
- and more