Skip to content

Fast asynchronous GPU monitoring tool across multiple machines through SSH

License

Notifications You must be signed in to change notification settings

afspies/gpu-monitor

Repository files navigation

SSH GPU Monitor πŸ–₯️

Python 3.9 License: MIT

A fast, asynchronous GPU monitoring tool that provides real-time status of NVIDIA GPUs across multiple machines through SSH, with support for jump hosts and per-machine credentials.

Example Output

✨ Features

  • Real-time Monitoring: Live updates of GPU status across multiple machines
  • Asynchronous Operation: Fast, non-blocking checks using asyncio and asyncssh
  • Jump Host Support: Access machines behind a bastion/jump host
  • Rich Display: Beautiful terminal UI using the rich library
  • Flexible Configuration:
    • YAML-based configuration
    • Per-machine SSH credentials
    • Pattern-based target generation
  • Robust Error Handling: Graceful handling of network issues and timeouts

πŸš€ Installation & Usage

Install from PyPI

pip install ssh-gpu-monitor

Run the Monitor

After installation, you can run the monitor in several ways:

# Run using the command-line tool
ssh-gpu-monitor

# Or run as a Python module
python -m ssh_gpu_monitor

# Use a custom config file
ssh-gpu-monitor --config /path/to/your/config.yaml

# Get the default config path
ssh-gpu-monitor --get_config_path

Configuration

  1. Get the default config path:
ssh-gpu-monitor --get_config_path
  1. Either:
    • Copy the default config to your preferred location and use --config to specify it
    • Modify the default config directly
    • Use command line options to override any config values (see below)

Example config file:

ssh:
  username: "your_username"
  key_path: "~/.ssh/id_rsa"
  jump_host: "jump.example.com"
  timeout: 10

targets:
  individual:
    - "gpu-server1"
    - "gpu-server2"

display:
  refresh_rate: 5

πŸ“– Configuration

Basic Structure

ssh:
  username: "default_user"  # Default username
  key_path: "~/.ssh/id_rsa"  # Default SSH key
  jump_host: "jump.example.com"
  timeout: 10  # seconds

targets:
  # Individual machines
  individual:
    - host: "gpu-server1"
      username: "different_user"  # Optional override
      key_path: "~/.ssh/special_key"  # Optional override
    - "gpu-server2"  # Uses default credentials
  
  # Pattern-based groups
  patterns:
    - prefix: "gpu"
      start: 1
      end: 30
      format: "{prefix}{number:02}"  # Results in gpu01, gpu02, etc.
      username: "gpu_user"  # Optional override
      key_path: "~/.ssh/gpu_key"  # Optional override

display:
  refresh_rate: 5  # seconds

debug:
  enabled: false
  log_dir: "logs"
  log_file: "gpu_checker.log"
  log_max_size: 1048576  # 1MB
  log_backup_count: 3

Command Line Options

Override any configuration option via command line:

# Enable debug logging
python main.py --debug.enabled

# Override SSH settings
python main.py --ssh.username=other_user --ssh.key_path=~/.ssh/other_key

# Check specific targets
python main.py --targets gpu01 gpu02 special-server

πŸ”§ Advanced Usage

Custom Target Patterns

Generate targets using patterns:

patterns:
  - prefix: "compute"
    start: 1
    end: 100
    format: "{prefix}-{number:03d}"  # compute-001, compute-002, etc.

Per-Machine Credentials

Specify different credentials for specific machines:

individual:
  - host: "special-gpu"
    username: "admin"
    key_path: "~/.ssh/admin_key"

Debug Logging

Enable detailed logging for troubleshooting:

debug:
  enabled: true
  log_dir: "logs"
  log_file: "debug.log"

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

Original Contributors

Originally created as "some awful, brittle code to check GPU status of multiple machines at a given host address through an SSH jumpnode."

Special thanks to:

Libraries

  • Rich for the beautiful terminal interface
  • asyncssh for async SSH support
  • PyYAML for configuration management

πŸ” Similar Projects

The following projects are similar in spirit, but only support a single machine:

⚠️ Known Issues

  • SSH connection might timeout on very slow networks
  • Some older NVIDIA drivers might return incompatible XML formats

πŸ“Š Roadmap

  • Add support for AMD GPUs
  • Implement process name filtering
  • Add web interface
  • Support for custom SSH config files

About

Fast asynchronous GPU monitoring tool across multiple machines through SSH

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages