A fast, asynchronous GPU monitoring tool that provides real-time status of NVIDIA GPUs across multiple machines through SSH, with support for jump hosts and per-machine credentials.
- Real-time Monitoring: Live updates of GPU status across multiple machines
- Asynchronous Operation: Fast, non-blocking checks using
asyncio
andasyncssh
- Jump Host Support: Access machines behind a bastion/jump host
- Rich Display: Beautiful terminal UI using the
rich
library - Flexible Configuration:
- YAML-based configuration
- Per-machine SSH credentials
- Pattern-based target generation
- Robust Error Handling: Graceful handling of network issues and timeouts
pip install ssh-gpu-monitor
After installation, you can run the monitor in several ways:
# Run using the command-line tool
ssh-gpu-monitor
# Or run as a Python module
python -m ssh_gpu_monitor
# Use a custom config file
ssh-gpu-monitor --config /path/to/your/config.yaml
# Get the default config path
ssh-gpu-monitor --get_config_path
- Get the default config path:
ssh-gpu-monitor --get_config_path
- Either:
- Copy the default config to your preferred location and use
--config
to specify it - Modify the default config directly
- Use command line options to override any config values (see below)
- Copy the default config to your preferred location and use
Example config file:
ssh:
username: "your_username"
key_path: "~/.ssh/id_rsa"
jump_host: "jump.example.com"
timeout: 10
targets:
individual:
- "gpu-server1"
- "gpu-server2"
display:
refresh_rate: 5
ssh:
username: "default_user" # Default username
key_path: "~/.ssh/id_rsa" # Default SSH key
jump_host: "jump.example.com"
timeout: 10 # seconds
targets:
# Individual machines
individual:
- host: "gpu-server1"
username: "different_user" # Optional override
key_path: "~/.ssh/special_key" # Optional override
- "gpu-server2" # Uses default credentials
# Pattern-based groups
patterns:
- prefix: "gpu"
start: 1
end: 30
format: "{prefix}{number:02}" # Results in gpu01, gpu02, etc.
username: "gpu_user" # Optional override
key_path: "~/.ssh/gpu_key" # Optional override
display:
refresh_rate: 5 # seconds
debug:
enabled: false
log_dir: "logs"
log_file: "gpu_checker.log"
log_max_size: 1048576 # 1MB
log_backup_count: 3
Override any configuration option via command line:
# Enable debug logging
python main.py --debug.enabled
# Override SSH settings
python main.py --ssh.username=other_user --ssh.key_path=~/.ssh/other_key
# Check specific targets
python main.py --targets gpu01 gpu02 special-server
Generate targets using patterns:
patterns:
- prefix: "compute"
start: 1
end: 100
format: "{prefix}-{number:03d}" # compute-001, compute-002, etc.
Specify different credentials for specific machines:
individual:
- host: "special-gpu"
username: "admin"
key_path: "~/.ssh/admin_key"
Enable detailed logging for troubleshooting:
debug:
enabled: true
log_dir: "logs"
log_file: "debug.log"
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
Originally created as "some awful, brittle code to check GPU status of multiple machines at a given host address through an SSH jumpnode."
Special thanks to:
- @harrygcoppock and @minut1bc for their PRs on v1
- gpuobserver for earlier code concepts
- Stack Overflow answer for SSH connection handling insights
- Rich for the beautiful terminal interface
- asyncssh for async SSH support
- PyYAML for configuration management
The following projects are similar in spirit, but only support a single machine:
- SSH connection might timeout on very slow networks
- Some older NVIDIA drivers might return incompatible XML formats
- Add support for AMD GPUs
- Implement process name filtering
- Add web interface
- Support for custom SSH config files