GPU#

For deep learning, scientific computing, and graphics workloads, nvidia-smi is the essential tool for GPU monitoring and management. From checking memory usage to tuning power limits and managing multi-GPU configurations, these commands help maximize GPU utilization and diagnose performance bottlenecks.

An interactive demo script is available at src/bash/gpu.sh to help you experiment with the concepts covered in this cheatsheet.

./src/bash/gpu.sh           # Run all demos
./src/bash/gpu.sh query     # Run query format demo
./src/bash/gpu.sh --help    # Show all available sections

Basic GPU Information#

Commands for viewing graphics card details and driver information.

lspci | grep -i vga                 # List graphics cards
lspci | grep -i nvidia              # NVIDIA devices
lspci -v -s $(lspci | grep -i vga | cut -d' ' -f1)  # Detailed GPU info

# OpenGL info
glxinfo | grep -i "renderer\|vendor\|version"

# Vulkan info
vulkaninfo --summary

nvidia-smi Basics#

The nvidia-smi tool provides monitoring and management capabilities for NVIDIA GPUs.

nvidia-smi                          # GPU status overview
nvidia-smi -L                       # List GPUs
nvidia-smi -q                       # Full query (all details)
nvidia-smi -q -d MEMORY             # Memory details
nvidia-smi -q -d UTILIZATION        # Utilization details
nvidia-smi -q -d TEMPERATURE        # Temperature details
nvidia-smi -q -d POWER              # Power details
nvidia-smi -q -d CLOCK              # Clock speeds

nvidia-smi Monitoring#

Commands for continuous GPU monitoring and process tracking.

# Continuous monitoring
nvidia-smi -l 1                     # Update every 1 second
nvidia-smi dmon -s u                # Device monitoring (utilization)
nvidia-smi dmon -s p                # Power monitoring
nvidia-smi dmon -s t                # Temperature monitoring
nvidia-smi dmon -s m                # Memory monitoring

# Process monitoring
nvidia-smi pmon -s u                # Process utilization
nvidia-smi pmon -s m                # Process memory

# Watch specific metrics
watch -n 1 nvidia-smi --query-gpu=utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv

nvidia-smi Query Format#

Custom queries allow extracting specific GPU metrics in various formats.

# Custom query output
nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv
nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
nvidia-smi --query-gpu=temperature.gpu,fan.speed,power.draw,power.limit --format=csv
nvidia-smi --query-gpu=clocks.gr,clocks.mem,clocks.sm --format=csv

# Query specific GPU
nvidia-smi -i 0 --query-gpu=name,memory.used --format=csv
nvidia-smi -i 0,1 --query-gpu=name,utilization.gpu --format=csv

# Process queries
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

# Available query options
nvidia-smi --help-query-gpu         # List all GPU query options
nvidia-smi --help-query-compute-apps  # List process query options

Common Query Fields#

# GPU identification
name, gpu_name, gpu_bus_id, gpu_serial, gpu_uuid, vbios_version

# Memory
memory.total, memory.used, memory.free

# Utilization
utilization.gpu, utilization.memory

# Temperature and power
temperature.gpu, fan.speed, power.draw, power.limit

# Clocks
clocks.gr, clocks.mem, clocks.sm, clocks.max.gr, clocks.max.mem

# Performance
pstate, performance_state

nvidia-smi Management#

Commands for configuring GPU settings and performance tuning.

# Set persistence mode (reduces latency)
nvidia-smi -pm 1                    # Enable
nvidia-smi -pm 0                    # Disable

# Set power limit (watts)
nvidia-smi -pl 250                  # Set to 250W

# Set GPU clocks
nvidia-smi -lgc 1200,1800           # Lock graphics clock range
nvidia-smi -rgc                     # Reset graphics clocks

# Set compute mode
nvidia-smi -c 0                     # Default
nvidia-smi -c 1                     # Exclusive thread
nvidia-smi -c 2                     # Prohibited
nvidia-smi -c 3                     # Exclusive process

# Reset GPU
nvidia-smi -r                       # Reset GPU

# Enable/disable ECC
nvidia-smi -e 1                     # Enable ECC
nvidia-smi -e 0                     # Disable ECC

Multi-GPU Systems#

Commands for managing multiple GPUs, topology, and interconnects.

# Topology
nvidia-smi topo -m                  # GPU topology matrix
nvidia-smi topo -p                  # P2P capabilities

# NVLink status
nvidia-smi nvlink -s               # NVLink status
nvidia-smi nvlink -c               # NVLink capabilities

# MIG (Multi-Instance GPU) - A100/H100
nvidia-smi mig -lgip               # List GPU instance profiles
nvidia-smi mig -cgi 9,9            # Create GPU instances
nvidia-smi mig -dci                # Destroy compute instances

CUDA Environment#

Environment variables and commands for CUDA configuration.

# Check CUDA version
nvcc --version
cat /usr/local/cuda/version.txt

# CUDA environment variables
export CUDA_VISIBLE_DEVICES=0       # Use only GPU 0
export CUDA_VISIBLE_DEVICES=0,1     # Use GPUs 0 and 1
export CUDA_VISIBLE_DEVICES=""      # Hide all GPUs

# Check CUDA device properties
nvidia-smi -q -d COMPUTE

# cuDNN version
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2