name: dl-gpu description: Comprehensive monitoring and management for dl GPUs using dlsmi command-line tool. Use when users need to monitor GPU status (temperature, memory, utilization, power), query GPU information and metrics, check running processes on GPUs, perform GPU management operations (reset, topology, P2P), troubleshoot GPU issues, or optimize GPU performance for AI/ML workloads. Includes commands for deep learning training monitoring, model inference, and performance debugging.
dl GPU
Monitor and manage dl GPUs using the dlsmi command-line interface.
Quick Start
# View all GPU status
dlsmi
# Query specific GPU metrics
dlsmi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv
# Monitor continuously (update every 1 second)
dlsmi -l 1
# Check running processes
dlsmi --query-compute-apps=pid,used_memory,gpu_bus_id --format=csv
Common Tasks
Monitor GPU Status
Basic status overview:
dlsmi
Detailed metrics query:
dlsmi --query-gpu=index,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free,memory.total,power.draw,pstate --format=csv
Continuous monitoring:
# Update every second
dlsmi -l 1
# Update every 500ms
dlsmi -lms 500
Find Specific Information
Hottest GPU:
dlsmi --query-gpu=index,temperature.gpu --format=csv,noheader,nounits | sort -t',' -k2 -nr | head -1
GPU with most free memory:
dlsmi --query-gpu=index,memory.free --format=csv,noheader,nounits | sort -t',' -k2 -nr | head -1
GPU with highest utilization:
dlsmi --query-gpu=index,utilization.gpu --format=csv,noheader,nounits | sort -t',' -k2 -nr | head -1
Check Running Processes
# List all compute processes
dlsmi --query-compute-apps=pid,used_memory,gpu_bus_id --format=csv
# Include process names
dlsmi --query-compute-apps=pid,used_memory,gpu_bus_id,name --format=csv
Query Specific Metrics
Temperature:
dlsmi --query-gpu=temperature.gpu --format=csv,noheader,nounits
Memory usage:
dlsmi --query-gpu=memory.used,memory.free,memory.total --format=csv
Power consumption:
dlsmi --query-gpu=power.draw,power.limit --format=csv,noheader,nounits
Clock speeds:
dlsmi --query-gpu=clocks.current.sm,clocks.current.memory --format=csv,noheader,nounits
PCIe link status:
dlsmi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv,noheader,nounits
Target Specific GPUs
# Query GPU 0 only
dlsmi -i 0 --query-gpu=temperature.gpu --format=csv,noheader,nounits
# Query GPUs 0, 1, and 2
dlsmi -i 0,1,2 --query-gpu=name,temperature.gpu --format=csv
# Query by PCI bus ID
dlsmi -i 0000:01:00.0 --query-gpu=name --format=csv,noheader
Deep Learning Monitoring
Training workload monitoring:
dlsmi --query-gpu=index,utilization.gpu,utilization.memory,memory.used,temperature.gpu,power.draw --format=csv -l 1
Memory tracking for large models:
dlsmi --query-gpu=index,memory.used,memory.free,memory.total --format=csv
Performance debugging:
dlsmi --query-gpu=clocks.current.sm,clocks.current.memory,utilization.gpu,temperature.gpu,power.draw --format=csv
Advanced Operations
Topology and P2P
# Display GPU topology
dlsmi topo -m
# Display P2P capabilities
dlsmi topo -p2p
# Find nearest GPUs
dlsmi topo -i 0 -n 0
GPU Reset
# Reset specific GPU (use with caution!)
sudo dlsmi -i 0 -r
# Reset and rescan
sudo dlsmi -i 0 -rr
Multi-Instance GPU (MIG)
# Check MIG status
dlsmi --query-gpu=mig.mode.current --format=csv,noheader
# Enable MIG (requires root)
sudo dlsmi -i 0 -mig 1
Troubleshooting
GPU Not Detected
# Check if GPU is visible in lspci
lspci | grep -i co-processor
# Check kernel driver
lsmod | grep denglin
# Check driver module status
dkms status denglin
High Temperature
# Check temperature vs limit
dlsmi --query-gpu=index,temperature.gpu,temperature.gpu.tlimit --format=csv
# Check for thermal throttling
dlsmi --query-gpu=clocks_event_reasons.hw_thermal_slowdown --format=csv,noheader
Out of Memory
# Check memory usage by process
dlsmi --query-compute-apps=pid,used_memory,gpu_bus_id --format=csv
# Check total memory usage
dlsmi --query-gpu=index,memory.used,memory.free,memory.total --format=csv
Performance Issues
# Check for clock throttling reasons
dlsmi --query-gpu=clocks_event_reasons.gpu_idle,clocks_event_reasons.hw_slowdown,clocks_event_reasons.sw_power_cap --format=csv
# Check performance state
dlsmi --query-gpu=pstate --format=csv,noheader,nounits
Resources
For detailed command reference and advanced options, see COMMANDS.md.
For use case examples and workflows, see EXAMPLES.md.