GPU Troubleshooting for LLM/PyTorch (H100/B200)

Every command shows: the exact CLI, why we run it, what each flag does, and how to fix issues with sources of test cases proving the fix works.

Legend: Healthy Unhealthy Command
P‑states (performance states) — quick sanity

Lower number = higher clocks. For compute, expect P2 or P0. P8 during heavy work means idle, bottleneck, or throttling.

ELI5: Gears on a car. P2 is fast driving; P8 is idling. If you’re racing but stuck idling, fix heat, power, or the pit crew (data pipeline).
Reference: NVIDIA P‑state docs (NVAPI) — docs.nvidia.com (P‑State)
1) nvidia-smi -l 1 — live dashboard
Why run this?

To watch utilization, memory, temperature, and P‑state change live. Confirms if the GPU is actually busy and whether it’s throttling.

Flags
  • -l 1: loop/refresh every 1 second (use -lms for ms granularity).
CommandHealthy
nvidia-smi -l 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 555.32.00    Driver 555.32.00    CUDA 12.4                        |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M.        |
|===============================+======================+======================|
|  0  NVIDIA H100         On   | 00000000:65:00.0 Off |                  Off |
| 60%   75C   P2  300W / 500W | 20,480MiB / 80,000MiB |  98%   Default |
+-----------------------------------------------------------------------------+
| 5467  C+G   python train_llm.py                          20,432MiB          |
+-----------------------------------------------------------------------------+

High util, P2, temps below 80°C → normal training.

CommandUnhealthy
nvidia-smi -l 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 555.32.00    Driver 555.32.00    CUDA 12.4                        |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M.        |
|===============================+======================+======================|
|  0  NVIDIA H100         On   | 00000000:65:00.0 Off |                  Off |
| 90%   95C   P8  150W / 500W | 78,800MiB / 80,000MiB |  12% Throttled |
+-----------------------------------------------------------------------------+

Low util + hot + P8 → thermal/power throttling or data starvation.

How to fix
  • Lower temps: clean filters, increase airflow/fan curve, reduce ambient temp.
  • Avoid VRAM spill: enable AMP, reduce batch size, gradient checkpointing.
  • Speed up input pipeline: increase num_workers, pin_memory, pre-tokenize.
Sources / Test cases
2) nvidia-smi -q -d TEMPERATURE,PERFORMANCE — throttle reasons
Why run this?

To see if the GPU is slowing itself due to heat or power and which limiter is active.

Flags
  • -q: query detailed fields; -d: restrict to specific domains (TEMPERATURE, PERFORMANCE).
CommandHealthy
nvidia-smi -q -d TEMPERATURE,PERFORMANCE
GPU 00000000:65:00.0
    Temperature
        GPU Current Temp            : 73 C
        GPU Slowdown Temp           : 90 C
    Performance State               : P2
    Clocks Throttle Reasons
        Power Limit                 : Not Active
        Thermal Slowdown            : Not Active
        HW Slowdown                 : Not Active
CommandUnhealthy
nvidia-smi -q -d TEMPERATURE,PERFORMANCE
GPU 00000000:65:00.0
    Temperature
        GPU Current Temp            : 91 C
        GPU Slowdown Temp           : 90 C
    Performance State               : P8
    Clocks Throttle Reasons
        Power Limit                 : Active
        Thermal Slowdown            : Active
        HW Slowdown                 : Active
How to fix
  • Reduce thermal load (airflow, fans, datacenter inlet temp); check paste/contacts.
  • Verify provider power policies; request higher cap if allowed.
Sources / Test cases
3) nvidia-smi -q -d POWER — power caps & draw
Why run this?

To confirm if a power limiter is holding clocks down (common in dense racks or provider-imposed caps).

Flags
  • -q: detailed query; -d POWER: power-only fields (draw, limit, throttle reason).
CommandHealthy
nvidia-smi -q -d POWER
Power Readings
    Power Draw                  : 312.43 W
    Power Limit                 : 500.00 W
    Performance State           : P2
Clocks Throttle Reasons
    Power Limit                 : Not Active
CommandUnhealthy
nvidia-smi -q -d POWER
Power Readings
    Power Draw                  : 148.12 W
    Power Limit                 : 500.00 W
    Performance State           : P8
Clocks Throttle Reasons
    Power Limit                 : Active
How to fix
  • Request/raise power limit within spec (sudo nvidia-smi -pl <watts>) if permitted.
  • Reduce spikes with AMP and stable batch times; avoid CPU stalls that drop clocks.
Sources / Test cases
4) nvidia-smi topo -m — PCIe/NVLink topology
Why run this?

To detect slow host↔GPU links (e.g., Gen3x4) that starve the GPU when transferring batches or model shards.

Flags
  • topo -m: prints matrix/topology with link types (PCIe gen/width, NVLink status).
CommandHealthy
nvidia-smi topo -m
        GPU0    CPU Affinity
GPU0     X      0-63
        NUMA Affinity 0
        Link 0: PCIe Gen4x16
        NVLink: 0 links active (single-GPU)
CommandUnhealthy
nvidia-smi topo -m
        GPU0    CPU Affinity
GPU0     X      0-63
        NUMA Affinity 0
        Link 0: PCIe Gen3x4
        NVLink: 0 links active
How to fix
  • Select instances with Gen4x16 (or NVLink for multi-GPU). On-prem: ensure GPU sits in a x16 Gen4/5 slot; update BIOS.
  • Co-locate CPU affinity/NUMA with the GPU for better host↔device bandwidth.
Sources / Test cases
5) nvtop — live process/util monitor
Why run this?

Instant feedback on GPU util vs CPU util to spot data-loader starvation (GPU idle while CPU 90–100%).

Flags
  • No flags needed; install via package manager and run.
CommandHealthy
sudo apt-get update && sudo apt-get install -y nvtop
nvtop
GPU0 [█████████████████████████████████████████████] 97%
Mem : 20.0/80.0 GiB  Temp: 75°C  Power: 310W
CPU: 18%   MEM: 22.3 GiB / 128.0 GiB
CommandUnhealthy
nvtop
GPU0 [███                                     ] 12%
Mem : 78.8/80.0 GiB  Temp: 90°C  Power: 150W (throttled)
CPU: 98%   MEM: 118.7 GiB / 128.0 GiB
How to fix
  • Increase num_workers, enable pin_memory, and prefetch to keep GPU fed.
  • Pre-tokenize/serialize datasets to avoid CPU tokenization in the training loop.
Sources / Test cases
PyTorch — proven fixes with code & why
A) DataLoader throughput
Why

GPU underutilized & CPU pegged → data loader is the bottleneck. Multiple workers and pinned memory reduce H2D copy latency.

Healthy High-throughput
loader = DataLoader(dataset,
    batch_size=64, shuffle=True,
    num_workers=8,              # parallel CPU workers
    pin_memory=True,            # page-locked host RAM for fast copies
    persistent_workers=True,    # keep workers alive
    prefetch_factor=4)
for xb, yb in loader:
    xb = xb.to(device, non_blocking=True)
    yb = yb.to(device, non_blocking=True)
Unhealthy Starved GPU
loader = DataLoader(dataset,
    batch_size=64, shuffle=True,
    num_workers=0,         # main thread only
    pin_memory=False)      # pageable → slower copies
for xb, yb in loader:
    xb = xb.to(device)  # blocking copies
    yb = yb.to(device)
Sources / Test cases
B) Mixed Precision (AMP)
Why

Cuts memory footprint (~½ for activations) and speeds up kernels on Tensor Cores; reduces VRAM pressure that can cause spills and throttling.

Healthy AMP on
scaler = torch.cuda.amp.GradScaler()
for xb, yb in loader:
    xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        loss = model(xb, yb)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
Unhealthy FP32-only
for xb, yb in loader:
    xb = xb.to(device); yb = yb.to(device)
    optimizer.zero_grad()
    # No autocast/GradScaler → higher VRAM & slower
    loss = model(xb, yb)
    loss.backward()
    optimizer.step()
Sources / Test cases
C) Gradient Checkpointing
Why

Trades extra compute for lower memory; prevents VRAM spill that kills utilization.

Healthy Fit bigger models
from torch.utils.checkpoint import checkpoint
def fwd(x): return block(x)
out = checkpoint(fwd, x)
Unhealthy OOM & spill
out = block(x)  # CUDA OOM → host spill → 1–10% GPU util
D) Stable tensor shapes
Why

Stable shapes improve kernel fusion & caching; erratic shapes introduce overhead and allocator churn.

Healthy Pad to fixed L
batch = pad_to_length(batch, L=4096)
mask  = build_mask(L=4096)
out   = model(batch, mask)
Unhealthy Varying shapes
out = model(batch, mask)  # different lengths each step → slower
E) Tokenizer throughput
Why

Tokenization is CPU-bound; doing it in the training loop starves the GPU.

Healthy Fast tokenizer + pretokenize
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(model, use_fast=True)
# Pre-tokenize to files; training reads token IDs directly.
Unhealthy Slow regex tokenizer inline
# Python regex tokenizer inside the training loop
# GPU waits while CPU chews text.
Profiling / Proof tools