杨艺韬2026-04-2811,322 字约 23 分钟

第21章 Profiler 与性能调优

“Optimization without measurement is religion.”

—— 古老的工程谚语，用在 PyTorch 训练调优同样适用

本章要点

torch.profiler.profile 是 PyTorch 内置 profiler，能同时记录 CPU 函数调用、CUDA kernel 执行、显存分配、autograd 事件
底层是 NVIDIA Kineto + CUPTI：通过 CUDA 的 Profiling Tools Interface 拿到 kernel 级数据
三种主要输出：text summary（key_averages()）、chrome trace（拖进 chrome://tracing 看时间线）、TensorBoard 插件（更友好的可视化）
Memory Snapshot：torch.cuda.memory._record_memory_history() 捕获每次 alloc/free 的栈，配 pytorch.org/memory_viz 可视化（第 4 章 §4.11）
Schedule API 让 profiler 在多 step 训练里”录制几步、跳过几步、重复”，避免 100 GB 大 trace
record_function 是用户级注解：在自定义代码段插入 marker，让 trace 有可读名字

21.1 训练性能问题的诊断思路

训练性能问题大概分四类：

症状	可能原因
GPU 利用率低（< 80%）	DataLoader 跟不上 / 通信瓶颈 / 主机端瓶颈
GPU 利用率高但吞吐低	算子选错 dtype / kernel 调用频繁（调度开销）
OOM 显存爆	activation 太多 / 碎片化 / state 没释放
训练突然慢	某 step 触发隐式同步（如 `.item()`）、CUDA Graph 失效

诊断这四类问题的标配工具是 profiler。

21.2 一个最简 profiler 用法

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for batch in loader:
        loss = model(batch).sum()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if prof.step_num >= 5:
            break

print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=20))
prof.export_chrome_trace('trace.json')

输出表格大致：

---------------------------  ------------  ------------  ----------
Name                         CUDA total    CUDA avg      CPU total
---------------------------  ------------  ------------  ----------
aten::mm                     1.234s        12.34ms       0.123s
aten::add                    0.567s        5.67ms        0.056s
aten::layer_norm             0.234s        2.34ms        0.023s
...

关键开关：

record_shapes=True：记录每次算子调用的输入 shape，方便对比”形状相同的 op 加起来多少时间”
profile_memory=True：记录每次 alloc/free 的字节数
with_stack=True：附带 Python 栈，让你知道某算子是从哪行代码调的（开销更大、慢一些）

21.3 chrome trace：时间线可视化

prof.export_chrome_trace('trace.json') 输出一个 JSON 文件。把它拖进 chrome://tracing/ 或者 https://ui.perfetto.dev/，能看到：

CPU thread:  [Python] [aten::mm] [aten::relu] [Python] [aten::add] ...
                ↓ launch       ↓ launch
CUDA stream: ........ [mm kernel] [relu kernel] ........ [add kernel]

横轴是时间，纵轴是不同 thread / stream。每个矩形是一个事件，鼠标悬停显示详细信息（duration / args）。

诊断思路：

GPU stream 上有大段空白 → CPU 端追不上 GPU，可能是 DataLoader / Python 慢
CPU 上有 dispatcher 密集小事件 → 调用太多小算子，考虑 fuse / torch.compile
kernel 之间有间隙 → cudaStreamSynchronize 等同步操作（找隐式同步的 .item() / .cpu()）
memcpy H2D / D2H 异常多 → 数据没 pin / 频繁跨 device

21.4 Kineto：底层数据采集

profiler 不是 PyTorch 自己写采集逻辑，是接 NVIDIA 提供的 Kineto 库（独立项目，集成在 PyTorch 里）。Kineto 通过 CUPTI（CUDA Profiling Tools Interface）拿到：

每个 CUDA kernel 的 launch / completion 时间
memcpy 的 H2D / D2H 时间
NVTX range 标记
GPU 上 SM utilization、DRAM bandwidth 等硬件计数器

PyTorch 的 profiler_kineto.cpp 把 Kineto 数据接进 PyTorch 的事件 timeline，与 CPU 端的 dispatcher 调用、autograd 事件合并成统一 trace。

CUPTI 的开销不可忽略：开 profiler 后训练通常慢 10-30%。所以生产代码不要长开 profiler，而是用 schedule 录制几步。

21.5 Schedule API：录几步停几步

朴素 profile 整个 epoch 会产生几十 GB trace 文件，根本没法用。Schedule API：

from torch.profiler import schedule

my_schedule = schedule(
    skip_first=1,          # 跳过第 1 步 (warmup)
    wait=1,                # 然后等 1 步不录
    warmup=1,              # 预热 1 步
    active=3,              # 真正录 3 步
    repeat=2,              # 循环 2 次
)

with profile(
    activities=[...],
    schedule=my_schedule,
    on_trace_ready=lambda p: p.export_chrome_trace(f'/tmp/trace_{p.step_num}.json'),
) as prof:
    for step, batch in enumerate(loader):
        # 训练循环
        prof.step()    # 必须每步调一次让 schedule 推进

prof.step() 推进 profiler 状态机。状态机阶段：SKIP → WAIT → WARMUP → RECORD → ...。每次进入 on_trace_ready（一个 cycle 结束）触发 export。这种”采样式”录制让 trace 体积可控。

21.6 TensorBoard 插件

PyTorch 提供 TensorBoard 集成（独立包 torch_tb_profiler）：

pip install torch_tb_profiler
tensorboard --logdir=/tmp/profiler_output

打开浏览器看 GPU Kernel View / Distributed View / Memory View / Trace View。比 chrome trace 更高级 —— 自动分析”GPU idle 时间占比”、“top kernels”、“memory timeline”。生产调优用 TB 插件比 chrome trace 直接看更高效。

21.7 Memory Snapshot：显存调试利器

第 4 章 §4.11 提过：

torch.cuda.memory._record_memory_history(max_entries=100000)

# 跑训练几个 step
for batch in loader:
    train_step(batch)
    if step >= 5: break

torch.cuda.memory._dump_snapshot('mem.pickle')
torch.cuda.memory._record_memory_history(enabled=None)

把 mem.pickle 拖进 https://pytorch.org/memory_viz（PyTorch 官方 viewer），能看到：

每次 alloc 的时间、大小、调用栈
每次 free 的时间
显存随时间的曲线
哪些 alloc 长期占用、哪些短期
碎片化情况

诊断 OOM / 显存膨胀的金钥匙。第 4 章 §4.12.5 有完整诊断流程。

21.8 record_function：用户级注解

profiler 默认只看到 ATen 算子。如果你想给”自定义代码段”加 marker：

from torch.profiler import record_function

with record_function("my_data_preprocess"):
    data = preprocess(raw)

with record_function("my_attention_block"):
    out = my_attention(data)

record_function 在 trace 里产生一个父事件，包住下面所有子事件。在 chrome trace / TB 里能看到自定义名字，让 trace 易读得多。对复杂 pipeline 强烈建议加 record_function 标注关键阶段。

NVTX range 是更高级的等价物（在 NVIDIA Nsight Systems 里能看到）：

torch.cuda.nvtx.range_push("my_section")
do_work()
torch.cuda.nvtx.range_pop()

21.9 Distributed View：DDP/FSDP 的通信分析

profiler 在多 rank 训练时每个 rank 各自录一个 trace。把所有 rank 的 trace 文件夹喂给 TensorBoard，开 Distributed View 能看到：

每个 rank 的 timeline 对比
collective 通信的时间分布
各 rank 在 AllReduce 上等的时间（“等最慢的 rank”现象）
communication / computation overlap 比例

DDP / FSDP 调优靠这个 view 找瓶颈。“哪个 rank 慢”是分布式诊断第一问，distributed view 直接告诉你。

21.9.5 `_ExperimentalConfig`：高级 profiler 选项

torch/csrc/profiler/python/init.cpp:368 注册的 _ExperimentalConfig 暴露 12 个高级开关：

字段	用途
`profiler_metrics`	CUPTI 硬件计数器列表（如 `sm__cycles_active.avg.pct_of_peak_sustained_elapsed`），让 profiler 进 CUPTI profiling mode 拿 GPU 占用率 / DRAM 带宽等
`profiler_measure_per_kernel`	硬件指标按 kernel 测还是整段测
`verbose`	trace 含完整 Python call stack
`performance_events`	进一步定制要测的事件
`enable_cuda_sync_events`	把 CUDA stream / event 同步操作也放进 trace（默认关）—— 调试隐式同步用
`adjust_profiler_step`	让 profiler.step 时长与父 Python 事件对齐
`disable_external_correlation`	关闭 CPU-GPU 事件关联
`profile_all_threads`	多线程 profile（默认只 main thread）
`capture_overload_names`	trace 里出现 `aten::add.Tensor` 而不只是 `aten::add`
`record_python_gc_info`	把 Python GC 事件加入 trace（诊断 GC 卡顿）
`expose_kineto_event_metadata`	暴露 Kineto 事件原始 metadata 到 Python 端
`custom_profiler_config`	自定义 backend（如某厂商自家硬件）的配置字符串

from torch.profiler import _ExperimentalConfig, profile

config = _ExperimentalConfig(
    enable_cuda_sync_events=True,         # 看隐式同步
    capture_overload_names=True,           # 区分 aten::add.Tensor / .Scalar
    record_python_gc_info=True,             # 找 GC pause
)

with profile(activities=[...], experimental_config=config) as prof:
    train_step()

诊断高级问题（如训练偶尔卡顿）时这些开关价值很高 —— 能看到默认 trace 里被隐藏的 GC pause、stream 同步、overload 切换等”非典型”事件。

21.9.6 chrome trace 的 JSON schema

prof.export_chrome_trace('trace.json') 输出符合 Trace Event Format（Google Chrome / Perfetto 共用）的 JSON：

{
  "traceEvents": [
    {"name": "aten::add", "ph": "X", "ts": 12345, "dur": 8,
     "tid": 0, "pid": 12345, "args": {"shape": "[4,4]", "dtype": "float"}},
    {"name": "ProfilerStep#1", "ph": "X", "ts": 100, "dur": 50000, "tid": 0},
    {"name": "Memcpy HtoD", "ph": "X", "ts": 200, "dur": 30,
     "tid": 99, "args": {"bytes": 4096}}
  ]
}

每条事件 6 个核心字段：name / ph（X = duration、B/E = begin/end pair、i = instant）/ ts（开始时间，微秒）/ dur（持续时长）/ tid（thread id，用 stream id 表达 CUDA stream）/ args（额外属性）。

PyTorch profiler 的 tid 编码：

0~10000：CPU 线程
100000~：CUDA stream（CUDA:0 stream:0 → tid 100000，CUDA:1 stream:0 → 200000）
200000+：memcpy stream / pin_memory thread

理解这套编码让你看大型 trace 时能直接 grep 某个 stream 的所有事件，不用在 viewer 里 scroll。

21.9.7 `record_function` 的 C++ 实现

Python 端 with record_function("foo"): 进 C++ 是 at::RecordFunction 类（aten/src/ATen/record_function.h）。它的工作：

构造时调注册的 callbacks（profiler 会在这里挂钩，记录 begin 时间）
析构时调注册的”end” callbacks（profiler 记录 end 时间）
整段被 dispatcher 调用 c10::ThreadLocalDebugInfo 关联到当前线程

dispatcher 主路径上每个算子调用都构造一个 RecordFunction（如果有 callbacks 注册），第 5 章 §5.3.1 提过 Dispatcher::call 里有 RECORD_FUNCTION 宏。这就是 profiler 能”看到所有算子”的实现 —— 通过 RecordFunction 在 dispatcher 层面记录每个 op。

代价：开了 profiler 后每次 op 多一次 RecordFunction 构造（约 50-100ns）。这是 profiler 自身 10-30% overhead 的核心来源。生产代码不开 profiler 时 callbacks 列表为空，RecordFunction 几乎零开销（构造时检查 callbacks empty 立即返回）。

21.9.8 CUPTI 硬件计数器：从 kernel 时长走到 SM 利用率

profile() 默认只记 kernel 时长。要看更深的硬件指标（如 SM 占用率、DRAM 带宽），用 _ExperimentalConfig.profiler_metrics：

from torch.profiler import _ExperimentalConfig

config = _ExperimentalConfig(
    profiler_metrics=[
        "sm__cycles_active.avg.pct_of_peak_sustained_elapsed",       # SM 活跃率
        "smsp__sass_thread_inst_executed_op_fadd_pred_on.sum",        # FP add 总数
        "dram__bytes.sum",                                              # DRAM 读写字节
        "lts__t_sectors.sum.per_second",                                # L2 cache traffic
    ],
    profiler_measure_per_kernel=True,
)

CUPTI（CUDA Profiling Tools Interface）暴露了上千个 metric，分几大类：

类别	代表 metric	看什么
Compute	`sm__cycles_active`, `smsp__inst_executed`	SM 是否在算 vs idle
Memory	`dram__bytes`, `lts__t_sectors`	DRAM/L2 traffic
Latency	`gpu__time_active.avg`, `sm__warps_active`	warp 占用率（latency-hiding 程度）
Tensor Core	`sm__inst_executed_pipe_tensor`	TC 是否被用

实战分析路径：

flowchart TB
    Slow[kernel 执行慢]
    Slow --> Q1{SM 利用率 > 80%?}
    Q1 -->|否| MB[memory bound:<br/>DRAM 带宽是否打满?]
    Q1 -->|是| CB[compute bound:<br/>是否用 Tensor Core?]
    MB -->|是| OK1[已极限, 看能否减少 traffic<br/>fuse / 量化]
    MB -->|否| Latency[latency bound:<br/>warp occupancy 太低]
    Latency --> Block[调 block size / num_warps]
    CB -->|否| Force[force Tensor Core<br/>用对齐 shape / fp16 mm]
    CB -->|是| Try[已极限, 考虑算法改进]

    style MB fill:#fef3c7
    style CB fill:#dcfce7

实战例子：H100 上某个 GEMM kernel 跑 1ms。看 metric：SM 利用率 60%、Tensor Core 利用率 30% → compute bound + TC 利用不足。检查 input shape [4097, 768] —— 不是 16 倍数 → padding 到 [4112, 768] 后 TC 利用率升到 90%、kernel 加速到 0.6ms。

理解 CUPTI 让你从”猜”过渡到”测”。每个性能问题都有具体 metric 暴露根因，Nsight Compute / nsys 是更专业的 viewer。但 PyTorch profiler 内置就够 80% 调优场景。

21.9.9 profile × torch.compile：看 fused kernel 的真面目

torch.compile 让多个 ATen 算子 fuse 成一个 Triton kernel。profile 时这种 kernel 名字看起来像 triton_poi_fused_relu_add_mul_0——通过名字就能看出 fused 了哪些 op。

@torch.compile
def f(x, w):
    return torch.relu(x @ w + 1.0).mul(2.0)

with profile(activities=[CPU, CUDA]) as prof:
    f(input, weight)

print(prof.key_averages().table(sort_by='cuda_time_total'))

输出（精简）：

Name                                CUDA total
triton_poi_fused_add_mul_relu_1     0.234 ms      <- fused 三个 op 进 1 个 kernel
triton_per_fused_addmm_0            0.567 ms      <- fused matmul + bias add

对比不开 compile：

Name              CUDA total
aten::mm          0.523 ms
aten::add         0.078 ms
aten::relu        0.156 ms
aten::mul         0.043 ms

一眼看出 fusion 收益：4 个独立 kernel 合成 2 个 fused kernel，总时间从 0.8ms 降到 0.8ms（kernel 内 memory access 减少 → memory bound 算子受益）。如果某个预期会 fuse 的 op 没 fuse（仍是 aten::xxx 而非 triton_yyy），说明 fusion pattern miss → 看 output_code.py 找原因（§15.6.30）。

调优路径：

profile 看每个 fused kernel 时长 + 内含 op 名
找耗时大头 kernel
看 Inductor output_code.py 对应 Triton 源码
决定是否能进一步优化（如改 dtype、加 mark_dynamic、调 max_autotune）

这套流程让 compile 调优有迹可循。没有 profile 时 compile 是黑盒，开 profile 后每个决策都有数据支撑。

21.9.10 配合 py-spy：Python 端火焰图

profile 看 PyTorch 内部，但Python 端的瓶颈（如 Python 循环慢、numpy 操作多）profile 看不到。互补工具是 py-spy：

# 在训练进程跑起来后，单独终端
pip install py-spy
py-spy record --pid <pid> -o flame.svg --duration 60

输出 SVG 火焰图：每个矩形宽度 = 该函数 CPU 占用时间。互补 PyTorch profiler 的视角：

工具	看什么	不看什么
torch.profiler	ATen 算子 / CUDA kernel / Python 调用栈关联	纯 Python 循环细节
py-spy	Python 函数级 CPU 时间分布	GPU kernel
nsys	硬件级 timeline	高层 ATen 语义

实战：训练 GPU 利用率只 60%，profile 看 GPU stream 没 idle 但 CPU 端慢。py-spy 一查，发现某 Python 函数调用 np.concatenate 千次。改成 batched torch op → CPU 不再瓶颈、GPU 利用率上 95%。

py-spy 优势：

不需要修改代码：直接 attach 到 running process
低 overhead：~1% 性能影响
支持子进程：py-spy record --pid <pid> --subprocesses 自动跟 fork 出的 worker

生产建议把 py-spy 加进训练 toolkit，与 PyTorch profiler 配合用。一个看”PyTorch 内部”、一个看”Python 端”，共同覆盖完整调优空间。

21.9.11 真实案例：FSDP all_gather × forward overlap 调优

第 18 章 §18.5 提了 FSDP-2 的 all_gather × forward overlap。怎么验证 overlap 真的发生了？用 profiler。

with profile(activities=[CPU, CUDA]) as prof:
    for _ in range(3):
        out = fsdp_model(input)
        loss = out.sum()
        loss.backward()
        prof.step()

prof.export_chrome_trace('fsdp_trace.json')

打开 chrome trace，找到几条关键 stream：

default stream (CUDA:0:0)：跑 forward / backward 计算 kernel
comm stream (CUDA:0:1)：跑 NCCL all_gather / reduce_scatter
CPU thread：dispatcher 调用

理想情况：

default stream:   [layer0 fwd] [layer1 fwd] [layer2 fwd] ...
comm stream:      [ag layer1 ] [ag layer2 ] [ag layer3 ] ...
                       ↑ 与 layer0 同时跑   ↑ 与 layer1 同时跑

if 看到 [ag layerN] 紧贴 [layerN fwd]、几乎没有空隙 → overlap 成功。

实战发现 1：某 transformer 训练每层都有 0.2ms 空隙 → 检查发现 prefetch_size = 1（默认），没充分 overlap。改 FSDP(prefetch_policy=PrefetchPolicy.BACKWARD) + prefetch_size = 2 后，空隙消失，单步快 8%。

实战发现 2：comm stream 的某次 all_gather 单独耗时 30ms（其他都 5ms） → 同一参数被 reshard 了多次。原因：用了 FSDP-1 的 wrap_policy 没正确分组某层 → 该层每次 forward 都重新 unshard。改成正确 wrap 后修复。

这种”profile 看 stream 关系”是 FSDP 调优最直接的工具。没看 trace 之前的 FSDP 调优都是猜测。

21.9.12 真实案例：Transformer Attention 性能优化

最经典的 profile-driven 优化案例：transformer attention 从”naive 实现”到”FlashAttention”的演进。每一步都用 profile 数据驱动。

baseline (naive attention)：

def attention(q, k, v):
    scores = q @ k.transpose(-2, -1) / sqrt(d)    # [B, H, L, L]
    attn = softmax(scores, dim=-1)
    return attn @ v

profile 显示（B=4, H=32, L=2048, d=128）：

Name                    CUDA total
aten::matmul             8.5 ms      <- Q @ K^T
aten::div                0.4 ms      <- /sqrt(d)
aten::softmax            3.2 ms      <- 含 exp + reduction + div
aten::matmul             8.3 ms      <- @ V
合计                     ~20.4 ms

memory: scores [4, 32, 2048, 2048] = 1 GB  <- 中间显存爆炸点

优化 1：torch.compile 自动 fuse

@torch.compile 让 div 与 softmax 融合，节省一次 memory read/write。profile：~17 ms，节省 15%。但 1 GB scores tensor 仍存在。

优化 2：FlashAttention（手写算子）

FlashAttention 把 attention 做成”分 block 的 streaming 计算”：

for block_K, block_V in chunks(K, V):
    scores_block = Q @ block_K.T
    softmax 一部分
    output += softmax_part @ block_V

中间 scores 永不实例化、显存从 1 GB 降到几十 MB。profile：~5 ms，节省 75%。

PyTorch 用 F.scaled_dot_product_attention 自动选 backend：

out = F.scaled_dot_product_attention(q, k, v)
# v2.x 自动用 FlashAttention v2 / FlexAttention

profile 一对比，立刻看出三种实现的差异。这种”测量优化测量”的循环是性能工程的标准动作。

理解这个案例让你看到 PyTorch 内置 SDPA 不是”魔法 API”——它是几年研究 + 工程优化的结果，profile 是验证每个优化真的有效的关键工具。

21.9.13 memory_viz 实战阅读

memory_viz 是 PyTorch 官方显存可视化（pytorch.org/memory_viz）。把 dump 出的 pickle 拖进去看到的核心视图：

Memory Timeline（横轴时间、纵轴显存）：

显存
 ↑
80GB ─────────────────────────────────────
70GB        ╱╲          ╱╲         ╱╲
60GB       ╱  ╲        ╱  ╲       ╱  ╲
50GB ─────╱    ╲──────╱    ╲─────╱    ╲──── (训练 step 边界)
40GB ────fwd───bwd───fwd───bwd───fwd───bwd
30GB ────                                 (常驻 weight + optim state)
                                           → 时间

每个尖峰对应”forward 累积 activation”，回落对应”backward 释放 activation”。常驻部分是 weight + optim state。

实战诊断：

尖峰持续上涨（每 step 比上一 step 高几 GB）：activation 没释放、内存泄漏。检查代码里有没有把 tensor 加到 list / 缓存里
尖峰位置不对：第 2 个 step 的尖峰本应与第 1 个一致，如果更高 → caching allocator 碎片化
常驻部分大：weight 太大 → 用 FSDP / 量化降低；optim state 大 → 用 8-bit Adam

Allocations View：

按 size 排序所有 alloc。实战看 top 10：

size      count   total       fqn / stack
2.0 GB    1       2.0 GB      transformer.layer3.attention scores
1.0 GB    32      32.0 GB     transformer.layer*.activation
512 MB    64      32.0 GB     ...

最大 alloc 一目了然。如果 attention scores 2.0 GB 出现 → naive attention，应换 FlashAttention。

Stack View：

每个 alloc 关联到 Python stack trace。鼠标悬停看哪个文件 / 哪一行触发的 alloc。直接定位到代码。

理解 memory_viz 三个视图让你在 OOM 时快速定位根因。第 4 章 §4.11 给了机制，本章给”怎么读”。两章配合让显存调试不再玄学。

21.9.14 .item() / .cpu() / .tolist()：隐式同步陷阱

trace 里看到 cudaStreamSynchronize 频繁出现 → 几乎肯定是隐式同步。常见触发点：

loss = compute_loss(out, target)

# 隐式同步！
print(loss)                    # str() 触发 .item()
loss_value = loss.item()       # 显式触发
losses.append(loss.item())     # 训练循环里累积 metric

# 也是隐式同步
if loss < 1e-4: break          # bool() 触发 .item()
preds = out.cpu().numpy()      # GPU → CPU + Python conversion

每次 .item() 让 CPU 等 GPU stream 完成 → 打破 async pipeline。在 hot loop 里调几百次就崩了。

修复模式：

# 错误：每 step 同步
losses = []
for batch in loader:
    loss = train_step(batch)
    losses.append(loss.item())   # ← 每 step 同步

# 正确：把 metric 累积在 GPU 上, epoch 末再 .item()
loss_sum_gpu = torch.zeros(1, device='cuda')
for batch in loader:
    loss = train_step(batch)
    loss_sum_gpu += loss.detach()
mean_loss = (loss_sum_gpu / len(loader)).item()   # 只同步 1 次

profile 验证：修复前 trace 每 step 之间有 5-10ms gap、修复后 gap 消失。整体训练加速 10-20%。

更隐蔽的：

tensor.numel() 不同步（返回 Python int，meta 信息）
tensor.shape 不同步（同上）
if tensor:（隐式调 bool()）同步
tensor.tolist() / tensor.numpy() 同步
assert tensor.allclose(...) 同步

“避免在 hot loop 里调 .item() / .cpu()” 是 PyTorch 训练性能 No.1 经验法则。理解 trace 上的 cudaStreamSynchronize 就是这套法则的可视化体现。

21.9.15 NVTX range + Nsight Systems 实战

torch.cuda.nvtx 比 record_function 更专业（在 Nsight Systems 看到）：

import torch.cuda.nvtx as nvtx

for step, batch in enumerate(loader):
    nvtx.range_push(f"step_{step}")          # 嵌套层 1
    nvtx.range_push("forward")                # 嵌套层 2
    out = model(batch)
    nvtx.range_pop()                           # forward end

    nvtx.range_push("backward")
    out.sum().backward()
    nvtx.range_pop()                           # backward end

    nvtx.range_push("optimizer")
    optimizer.step()
    nvtx.range_pop()                           # optimizer end

    nvtx.range_pop()                           # step end

用 nsys 录制：

nsys profile --trace=cuda,nvtx,osrt -o report python train.py
nsys-ui report.qdrep

Nsight Systems 比 chrome trace 强大得多：

硬件级 timeline：含 GPU SM 占用、DRAM 带宽、PCIe traffic 实时曲线
NVTX range 层级：你定义的 step / forward / backward 自动嵌套展开
stream 关系：跨 stream 依赖以箭头显示
分析报告：自动生成”top kernel by occupancy / latency”等

生产调优 LLM 训练时 nsys 是终极工具。chrome trace 适合快速看一眼、nsys 适合深入分析。

实战工作流：

训练异常 → torch.profiler 录 trace 看大概问题
怀疑 GPU 端瓶颈 → nsys 录详细 trace 看 SM/DRAM 利用率
怀疑 Python 端 → py-spy 看火焰图

三个工具组合用，能 cover 99% 性能问题。每个都精通需要时间，但理解各自定位让你在不同场景选对工具。

21.9.16 profile 数据后处理：自定义聚合分析

prof.events() 返回所有事件列表，可以自己写分析逻辑：

events = prof.events()

# 聚合按 op name 看总时间
from collections import defaultdict
op_total = defaultdict(int)
for e in events:
    if e.device_type == DeviceType.CUDA:
        op_total[e.name] += e.cuda_time_total

# 按 shape 聚合：相同 op 不同 shape 各自时间
op_shape_total = defaultdict(int)
for e in events:
    key = (e.name, str(e.input_shapes))
    op_shape_total[key] += e.cuda_time_total

实战聚合：

Top kernel by total time：哪个 op 占总时间最多
Top kernel by call count：哪个 op 调用次数最多（可能小但累积大）
Per-shape analysis：发现”某个 shape 的 mm 特别慢”
Time per layer：用 record_function 标 layer 名后聚合

这种”自定义分析”在大规模训练里价值很高。预定义 view（chrome trace / TB 插件）解决 80% 问题，剩下的需要自己写脚本聚合。

例子：发现某 GEMM kernel 总耗时占 40%，但 chrome trace 里它分散在几百次调用 → 写脚本聚合按 input shape：发现某个 shape [4097, 768] 比其他都慢 2x。原因找到 → padding 到 16 倍数。这种基于聚合 metric 的优化是 profile 的高级用法。

21.9.17 H2D / D2H 拷贝优化案例

profile 看到 Memcpy HtoD 占总时间 30%+ → 几乎肯定数据传输不正确。诊断 + 修复流程：

症状：

Name                         CUDA total
Memcpy HtoD (Pageable→Device) 250 ms     <- 占整 step 30%
aten::add                     50 ms
aten::mm                      300 ms

Pageable→Device 是不走 pinned memory 的慢路径。每次 launch 阻塞 CPU 等内核建表 + DMA。

诊断步骤：

检查 DataLoader 配置：pin_memory=True？没开就立刻开
检查代码：tensor.cuda() vs tensor.cuda(non_blocking=True)？后者必须配 pin_memory 才有效
检查 source dtype：CPU tensor dtype 与 GPU 一致吗？不一致触发 implicit cast → 在 CPU 端做 cast 后才能 DMA
检查 copy 频率：是否每 step 多次 .cuda() 同一份数据？应该只 cuda 一次然后复用

修复案例：

# 错误：每 step pageable copy
for batch in loader:               # batch 在 CPU pageable memory
    batch = batch.cuda()           # 触发 pageable → device, 慢路径
    out = model(batch)

# 修复 1: pin_memory + non_blocking
loader = DataLoader(dataset, ..., pin_memory=True)
for batch in loader:
    batch = batch.cuda(non_blocking=True)    # pinned → device, 异步
    out = model(batch)

# 修复 2: 数据已在 GPU 就不要再 .cuda()
buffer_gpu = torch.empty(shape, device='cuda')
for batch_cpu in loader:
    buffer_gpu.copy_(batch_cpu, non_blocking=True)   # 只填新数据
    out = model(buffer_gpu)

profile 修复后：

Name                       CUDA total
Memcpy HtoD (Pinned→Device) 50 ms     <- 快路径, async, 与计算 overlap
aten::add                    50 ms
aten::mm                     300 ms

H2D 时间从 250ms 降到 50ms（快路径），且与计算 overlap → 整 step 时间降 30%。这种”profile 看到问题 → 立刻修复 → profile 验证”是性能调优最爽的循环。

21.9.18 长跑训练的滚动 profile + 异常检测

生产 LLM 训练跑几天到几周，偶发慢 step 不易复现。解法：rolling profile + 异常检测。

import time

step_times = []
profile_buffer = []

with profile(
    activities=[CPU, CUDA],
    schedule=schedule(wait=100, warmup=2, active=3, repeat=10000),
    on_trace_ready=lambda p: profile_buffer.append((time.time(), p.events())),
) as prof:
    for step, batch in enumerate(loader):
        t0 = time.time()
        train_step(batch)
        step_time = time.time() - t0
        step_times.append(step_time)
        prof.step()

        # 异常检测
        if step > 100:
            mean = sum(step_times[-100:]) / 100
            if step_time > mean * 2:
                # 这步比近 100 步均值慢 2x → 触发详细 dump
                dump_full_profile(prof, batch_id=step)
                send_alert(f"slow step at step {step}: {step_time:.2f}s vs mean {mean:.2f}s")

整套机制：

每 100 步 sample 一次 profile（schedule 控制）
保留最近几次 sample 在内存（不上传）
异常时立刻 dump + 上报
正常时 sample 数据可丢弃 / 抽样上传

实战：长跑 7 天训练有 3 次 slow step（30s vs 平均 5s），自动 dump trace 显示根因是 NFS 突然慢，pin_memory thread 等数据 30 秒。换本地 SSD cache 后修复。

理解这套 monitoring 模式让你能在生产主动发现问题，而非等到训练失败再回头查。是大模型 SRE 的标配能力。

21.9.19 Inductor output_code + profile 联合调试

profile 看到某 fused kernel 慢，怎么深入？联合 TORCH_COMPILE_DEBUG=1（§15.6.30）拿到 output_code，对照 profile 看每个 kernel 的实际行为。

工作流：

import os
os.environ['TORCH_COMPILE_DEBUG'] = '1'

@torch.compile
def f(x, w):
    return torch.nn.functional.gelu(x @ w)

with profile(activities=[CUDA]) as prof:
    f(input, weight)

# 1. profile 看到 triton_per_fused_addmm_gelu_0 耗时 5ms
# 2. 看 torch_compile_debug/.../triton_kernel_0.py 看实际 Triton 代码
# 3. 看 output_code.py 看 wrapper 调用方式

诊断：

profile 显示 kernel 5ms，但 input 仅 [128, 768] → 不应该这么慢
output_code 看 grid (128) → 只起了 128 个 block, GPU 大部分 SM 闲置
改用 max_autotune（试更多 block size），让 grid 增加 → kernel 降到 1.5ms

或者：

看 triton kernel 用了 tl.float32 而非 tl.float16（amp 没生效到这层）
检查发现 forward 没 wrap autocast → 加上后 kernel 快 1.8x

理解这种”profile + output_code”联合调试让你能深入到 Inductor 内部优化。两层信息缺一不可：profile 给”哪个 kernel 慢”、output_code 给”为什么慢”。

21.9.20 step time 异常检测的”指纹”

每个训练任务都有特定 step time 指纹（mean / variance / outlier 模式）。学会识别”指纹异常”：

graph TB
    subgraph Healthy[正常]
        H1[每 step 5.0 ± 0.1 s<br/>偶尔 5.5 s 是正常]
    end

    subgraph Pattern1[周期性突变]
        P1["每 100 step 出 5.5s"]
        P1 --> R1[Eval / ckpt 周期]
    end

    subgraph Pattern2[逐渐变慢]
        P2[step time 从 5s 涨到 8s]
        P2 --> R2[memory leak / 显存碎片化]
    end

    subgraph Pattern3[随机 spike]
        P3[偶尔 30s spike]
        P3 --> R3[NCCL straggler / IO hang]
    end

    subgraph Pattern4[loss 爆炸 + 慢]
        P4[loss spike + step time 长]
        P4 --> R4[GradScaler 减半重试 / inf 出现]
    end

    style R1 fill:#dcfce7
    style R2 fill:#fef3c7
    style R3 fill:#fee2e2
    style R4 fill:#fecaca

每种 pattern 对应不同根因，profile 能验证：

Pattern1：trace 显示 100 step 一次有 ckpt write 阻塞（async ckpt 没开）
Pattern2：memory_viz 显示常驻显存涨 1GB/step → 找代码里的 leak
Pattern3：distributed view 显示某 rank trace 在 AllReduce 上等 25s
Pattern4：trace 含 GradScaler.skip 事件、inf check 触发

把这个 pattern 表存为 SOP，新人遇到性能问题对照查 → 节省时间。“性能问题不是无规律的，每种症状有对应根因”。

21.9.21 多 rank trace 时序对齐：distributed view 的内部机制

torch_tb_profiler 的 distributed view 让多 rank trace 可视化对齐。关键问题：不同 rank 的 CPU 时钟不一致（NTP 同步精度只有几毫秒），怎么把它们的时间线对齐？

机制：

训练开始前所有 rank 调一次 torch.distributed.barrier()
barrier 结束瞬间所有 rank 拿当前 CPU 时间戳 t_local
broadcast rank 0 的 t_local 到所有 rank
每 rank 计算 offset = t_rank0 - t_local
后续 trace 的所有 timestamp 加上 offset → 对齐到 rank 0 的时间

精度：barrier 同步精度 ~微秒级（NCCL barrier 用 GPU 跑），加上时钟测量误差，整体精度 100us 内。足够用来分析”哪个 rank 慢几毫秒”。

distributed view 用对齐后的 trace 做几件事：

计算每个 collective 的 stragger：每 rank 进入 AllReduce 的时间差。差异大 → 有 rank 慢
overlap 比例：通信时间占总 step 时间多少
wait time on rank：每 rank 在 collective 上等了多久（暴露慢 rank）

实战：8 卡训练，rank 5 持续比其他慢 2ms。distributed view 显示该 rank 在每次 AllReduce 都慢 5ms 进入。原因：rank 5 所在的 GPU 因 ECC error 频繁纠错，单卡稍慢。换 GPU 修复。

理解多 rank trace 同步机制让你看到 distributed profile 不是黑盒，是工程上”测量时间 + 校准 offset”的标准实现。

21.9.22 自定义 profiler callback：Hook 进 RecordFunction

§21.9.7 提了 RecordFunction 是 profiler 实现机制。用户可以注册自家 callback 看每次 op 调用：

// C++ 端
#include <ATen/record_function.h>

at::addThreadLocalCallback(
    at::RecordFunctionCallback(
        [](const at::RecordFunction& fn, at::ObserverContext* ctx) {
            // 算子开始时调
            std::cout << "op: " << fn.name() << " inputs: " << fn.inputs() << "\n";
            return nullptr;
        },
        [](const at::RecordFunction& fn, at::ObserverContext* ctx) {
            // 算子结束时调
            std::cout << "op done: " << fn.name() << "\n";
        }
    ).needsInputs(true)
);

Python 端用 torch._C._add_profiler_callback（实验性）。

应用场景：

统计每个 op 调用次数：自家训练 monitor，不依赖完整 profile
kernel-level 数据加密：每个 GEMM input 的 hash 上报，验证训练数据正确性
自定义 trace 后端：把 trace 推到自家观测系统（如 Prometheus / 自家 timeline 平台）
op-level alert：某 op 输出 NaN 立刻报警（在 callback 里检测）

Meta、Google 等大厂都有自家”profile 系统”基于这套 callback 机制扩展。生产 LLM 训练监控通常超出 PyTorch 内置 profiler 能力，需要这层自定义。

理解 RecordFunction callback 让你看到 PyTorch 的 profile 不是封闭系统，是开放的扩展点。

21.9.23 训练 vs 推理 profile 的不同关注点

profile 训练与推理用的指标 + 分析方法完全不同：

维度	训练 profile	推理 profile
关注	step time、GPU util、显存、通信 overlap	latency、throughput、kv cache 利用率
主要瓶颈	通信 / DataLoader	kernel launch / kv cache miss
优化目标	总训练时间最短	单 query 延迟 / 集群 QPS
关键 metric	step/sec、samples/sec	tokens/sec、TTFT、ITL
调优工具	torch.profiler + chrome trace + py-spy	nsys + vLLM 内置 metric + Triton profiler

训练 profile 重点：

step time 分布（mean / p99 / outlier）
forward / backward / optim 各阶段占比
通信 / 计算 overlap 比例
显存峰值

推理 profile 重点：

TTFT (Time To First Token)：prefill 阶段时长
ITL (Inter-Token Latency)：decode 阶段每 token 时长
KV cache hit rate：prompt 复用情况
batched throughput：单 GPU tokens/sec
prefill / decode 调度均衡（continuous batching）

实战：vLLM 的内置 profile 远比 torch.profiler 实用 —— 它专门为推理服务设计、暴露每个 batch 的 prefill / decode 时间、调度状态等。

理解两条路径让你不会”用训练经验调推理”或反过来。两套 mindset、两套工具。

21.9.24 profiler API 演进时间线

PyTorch profiler 的关键节点：

版本	改进	意义
v0.4 (2018)	torch.autograd.profiler 引入	第一代
v1.5 (2020)	集成 Kineto / CUPTI	拿到 GPU kernel 数据
v1.8 (2021)	torch.profiler 新 API（替代 autograd.profiler）	更易用
v1.9 (2021)	TensorBoard 插件稳定	可视化大幅提升
v1.11 (2022)	Memory Snapshot / memory_viz	显存调试可视化
v2.0 (2023)	与 torch.compile 集成	看 fused kernel
v2.4 (2024)	Distributed View 完善	多 rank 协作分析
v2.6 (2025)	Trace Compare 工具	对比两次 profile 的差异
v2.10 (2025)	Async profile API	不阻塞训练的录制
v2.11 (2026)	API 稳定，生态完整	调优标配

整体趋势：

v0.x-v1.x：从”看时间”到”看完整 timeline”
v1.x-v2.x：从”单机”到”分布式”+ 与编译栈集成
v2.x：从”诊断”到”可观测平台”，与生产监控融合

理解时间线让你看 PyTorch 团队对 profile 的持续投入 —— 每个 minor 版本都有 profile 改进，因为 profile 是”性能优化的入口”，所有人都需要好用的 profile。这种”基础设施持续改进”是 PyTorch 生态成熟的标志之一。

21.9.25 大 trace 文件的处理：filter + split

profile 录 5 秒训练能产生 500MB-2GB JSON 文件。chrome tracing 加载 > 200MB 就卡顿。生产实战需要”切割 / 过滤”：

Filter 工具：

import json

def filter_events(input_path, output_path,
                  min_duration=10,        # 微秒，过滤短事件
                  exclude_names=None):    # 过滤特定 op
    exclude_names = set(exclude_names or [])
    with open(input_path) as f:
        data = json.load(f)

    filtered = []
    for e in data['traceEvents']:
        if e.get('dur', 0) < min_duration:
            continue
        if e.get('name', '') in exclude_names:
            continue
        filtered.append(e)

    data['traceEvents'] = filtered
    with open(output_path, 'w') as f:
        json.dump(data, f)

filter_events('big_trace.json', 'small_trace.json',
              min_duration=100,
              exclude_names=['aten::contiguous', 'aten::view'])

实战常见 filter：

min_duration=100us：去掉 dispatcher 内部小事件
过滤 aten::contiguous / aten::view 等”零成本”算子
过滤 [Scheduler] 等运行时内部事件

Split 工具：

def split_by_step(input_path, output_dir):
    """按 ProfilerStep 切分 trace, 每步独立文件"""
    with open(input_path) as f:
        data = json.load(f)

    steps = []
    current_step = []
    for e in sorted(data['traceEvents'], key=lambda x: x.get('ts', 0)):
        if e.get('name', '').startswith('ProfilerStep#'):
            if current_step:
                steps.append(current_step)
            current_step = [e]
        else:
            current_step.append(e)
    if current_step:
        steps.append(current_step)

    for i, step_events in enumerate(steps):
        out = {'traceEvents': step_events, 'displayTimeUnit': 'ns'}
        with open(f"{output_dir}/step_{i}.json", 'w') as f:
            json.dump(out, f)

切分后每文件 30-100MB，chrome tracing 顺畅打开。这套 filter / split 工具是生产 profile 必备脚本。

Perfetto 比 chrome tracing 处理大文件能力强得多（用 sqlite 后端），生产推荐用 Perfetto（ui.perfetto.dev）。

21.9.26 profiler 的内部架构图

把全章实现细节合起来看完整架构：

graph TB
    User["用户代码<br/>with profile():"]
    User --> PyAPI["Python: torch.profiler"]
    PyAPI --> CppPyBind["C++ pybind: profiler bindings"]

    CppPyBind --> RF[RecordFunction<br/>每 op 触发]
    CppPyBind --> Kineto[Kineto 库]

    RF --> CB[Profiler callback<br/>begin / end 时间]
    CB --> Buffer[Event Buffer<br/>thread-local]

    Kineto --> CUPTI[NVIDIA CUPTI]
    CUPTI --> Kernel[CUDA kernel events]
    CUPTI --> Mem[memcpy events]
    CUPTI --> HW[硬件 metric]

    Buffer --> Merge[合并 / 时间对齐]
    Kernel --> Merge
    Mem --> Merge
    HW --> Merge

    Merge --> Out{输出选择}
    Out --> Table[key_averages]
    Out --> Chrome[chrome trace JSON]
    Out --> TB[TensorBoard]
    Out --> Custom[用户自定义后处理]

    style RF fill:#fef3c7
    style Kineto fill:#dcfce7
    style Merge fill:#dbeafe

每层职责：

Python API：用户接口，控制 profile 生命周期
pybind：C++ 与 Python 桥接
RecordFunction：在 dispatcher 主路径捕获每个 op 调用
Kineto + CUPTI：拿 GPU 端原始数据
Merge 层：把 CPU events + GPU events 按时间对齐合并
Output：多种 view 共用同一份 merged data

理解架构让你看到 profile 不是单一组件，是 PyTorch + NVIDIA + 用户代码三方协作的产物。每层独立可扩展，让 profiler 持续演进而不破坏接口。

21.9.27 第三方工具综合对比

PyTorch profiler 不是唯一选择，根据场景选不同工具：

工具	强项	弱项	用法
torch.profiler	与 PyTorch 集成最好，看 ATen + GPU	大 trace 卡顿	训练调优 first choice
Nsight Systems (nsys)	硬件级 timeline，含 PCIe/DRAM	学习曲线陡	深入 GPU 性能
Nsight Compute (ncu)	kernel 级硬件分析（occupancy / cache）	慢，单 kernel 几秒	微调单个 kernel
py-spy	Python 火焰图，零代码改动	看不到 GPU	Python 端瓶颈
scalene	CPU + GPU + 内存综合分析	PyTorch 集成不完整	通用 Python profile
line_profiler	行级 CPU 时间	只 CPU	找慢 Python 函数
memory_profiler	行级内存	只 RAM 不看 GPU	找内存泄漏
viztracer	时间线 + 火焰图，开箱即用	不看 GPU	Python 应用整体分析
vLLM profile	推理 LLM 专用，看 KV cache	仅 vLLM	LLM 服务调优

实战工具组合：

训练慢：torch.profiler → 看不出再用 nsys
训练 OOM：memory_viz → 找不到再用 memory_profiler 看 CPU
推理慢：vLLM 内置 metric → 看不出再用 nsys
kernel 慢：torch.profiler 找哪个 kernel → ncu 分析硬件 metric

这种”分层使用”是性能工程师的标配。每个工具学精都需要时间，但理解定位让你能为问题选对武器。

21.9.28 训练监控 dashboard 集成

profile 是”事件驱动”调优工具，dashboard 是”持续观测”工具。生产 LLM 训练通常两者结合：

graph LR
    Train[训练进程]
    Train --> Metrics[训练 metric<br/>loss, lr, step time, throughput]
    Train --> System[系统 metric<br/>GPU util, mem, network]
    Train --> Profile[采样 profile<br/>每 N step 一次]

    Metrics --> Wandb[Weights & Biases]
    Metrics --> TB[TensorBoard]
    System --> Prom[Prometheus + Grafana]
    Profile --> Storage[S3 / 公司存储]

    Wandb --> Alert[异常检测<br/>step time > 2x mean → alert]
    Prom --> Alert
    Storage --> OnDemand[需要时下载 trace 分析]

    style Alert fill:#fee2e2

实战集成：

import wandb
from torch.profiler import profile, schedule

wandb.init(project="llm-training")

with profile(
    activities=[CPU, CUDA],
    schedule=schedule(wait=500, warmup=2, active=3, repeat=1000),
    on_trace_ready=lambda p: upload_trace(p, wandb.run.id),
) as prof:
    for step, batch in enumerate(loader):
        t0 = time.time()
        loss, throughput = train_step(batch)
        step_time = time.time() - t0

        # 持续 metric → wandb
        wandb.log({
            'loss': loss,
            'lr': lr,
            'step_time': step_time,
            'throughput': throughput,
            'gpu_mem': torch.cuda.memory_allocated() / 1e9,
        }, step=step)

        # 偶发 profile → 上传 S3
        prof.step()

dashboard 上看：

持续 metric：实时 loss 曲线、step time 曲线、显存曲线
采样 trace：每 500 step 一次完整 profile，按需下载分析
异常告警：step time / loss / 显存超阈值触发 PagerDuty / Slack

生产 LLM 训练 7×24 跑，把 profile 嵌进 monitoring 体系让团队能在故障前 5 分钟而非故障后 1 小时介入。

理解这套集成让你看到 profile 不只是”调试时用的工具”，是生产观测的核心数据源。Meta、OpenAI、Anthropic 等大厂都有专门的”训练监控团队”维护这套基础设施。

21.9.29 真实案例：定位训练偶发 hang

最难的性能问题是”训练偶尔卡死”——10 小时一次、每次卡 5 分钟然后自己恢复。chrome trace 不能 reproduce、py-spy 抓不到。怎么办？

步骤 1：开 rolling profile + 异常检测

§21.9.18 的方案。每 1000 步采样、step time > 5x mean 触发 dump。

步骤 2：异常 trace 拿到后定位

正常 step: 5s
异常 step: 60s (12x)

打开异常 trace，对比正常 trace 看差异：

正常: forward 2s + backward 2.5s + optim 0.5s = 5s
异常: forward 2s + backward 2.5s + optim 55.5s   <- optim 暴涨!

optim 内部 profile 看：

正常 optim:  Adam.step (0.4s) + zero_grad (0.05s) = 0.45s
异常 optim:  cudaMemcpy DtoH (54.9s) + Adam.step (0.4s)

cudaMemcpy DtoH 卡 54.9 秒？ 这是 GPU → CPU 拷贝。检查代码：

# 偶尔触发的代码:
if step % 100 == 0 and loss.item() > 1.0:    # ← .item() 触发同步!
    log_extra_metrics(loss)

loss.item() 在 hot loop 触发。正常时 loss < 1.0 不进 if、没问题；偶尔 loss > 1.0 时进 if、.item() + 后续 log_extra_metrics(...) 各调几次 .item() / .cpu() → 累积大量 H2D 拷贝 → 卡 1 分钟。

步骤 3：修复

# 修复: 把 metric 累积改成 batched
log_buffer.append(loss.detach())
if step % 100 == 0:
    log_extra_metrics(torch.stack(log_buffer))    # 一次同步, batched
    log_buffer.clear()

修复后偶发 hang 消失。整个调试过程 2 天，因为 profile 数据 + 异常检测让”偶发”变成”可观测”。

理解这种 case 让你看到 profile 不是”碰运气抓 bug”，是”长期监控 + 异常时拿到精确数据”的工程闭环。这是大模型训练 SRE 的核心方法论。

21.9.30 不同硬件后端的 profile 支持

PyTorch profiler 不只 CUDA，其他后端也支持但能力差异：

后端	profile 能力	限制
CUDA (NVIDIA GPU)	完整：CUPTI + Kineto + Memory Snapshot	最成熟
ROCm (AMD GPU)	较完整：rocprof + Kineto	部分 metric 缺失
XPU (Intel GPU)	基础：oneapi-pti + Kineto	还在迭代
MPS (Apple Silicon)	基础：MPSProfiler	缺硬件 metric
CPU	基础：函数级 timing	没有硬件 metric（用 perf 替代）
PrivateUse1 (国产芯片)	厂商自家 callback	各家实现不一

跨后端代码：

# 自适应不同后端
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    activities.append(ProfilerActivity.CUDA)
elif torch.xpu.is_available():
    activities.append(ProfilerActivity.XPU)

with profile(activities=activities) as prof:
    train_step()

实战：在 ROCm + AMD GPU 上 PyTorch profiler 能正常工作但 metric 比 CUDA 少；在 Apple M 系列上 MPS profile 能看 kernel 时长但看不到 SM 占用率。生产部署如果跨硬件，profile 流程要为每个后端验证。

理解后端差异让你看到 PyTorch 的 profile 不是”NVIDIA 专属”，是抽象层 + 各后端实现。但 NVIDIA 仍是事实标准（CUPTI 最成熟），其他后端在追赶。

21.9.31 一段总结：从 profile 到性能工程

把全章合起来看：profile 不是单一技能，是一套性能工程方法论：

graph TB
    Goal[确定优化目标<br/>throughput / latency / memory]
    Goal --> Baseline[Baseline 测量<br/>profile 当前性能]
    Baseline --> Hyp[假设瓶颈<br/>看 trace 形成 hypothesis]
    Hyp --> Fix[实施修复]
    Fix --> Verify[再次 profile<br/>验证 hypothesis]
    Verify --> Q{有改善?}
    Q -->|是| Next[找下一个瓶颈]
    Q -->|否| Hyp
    Next --> Goal2{达到目标?}
    Goal2 -->|是| Done[结束]
    Goal2 -->|否| Hyp

    style Baseline fill:#fef3c7
    style Verify fill:#dcfce7
    style Q fill:#dbeafe

每一步的关键技能：

目标量化：不要”模糊地追求快”，定具体 metric（“step time 从 5s 降到 3s”）
Baseline 测量：修改前先 profile，知道改善多少
Hypothesis 驱动：看 trace 后形成”我认为是 X 慢”，不是凭直觉
小步验证：每改一个修复就测，不要批量改后再 profile（不知道哪条改进有效）
成本约束：性能优化是工程成本权衡，不是无限投入

这套方法论与软件工程的”测量驱动开发”一脉相承。理解 profile 工具是基础、形成工程闭环才是更高一层的能力。

实战：调好一个 LLM 训练任务的性能通常需要 1-2 周。每天 profile 几次、改几行、再 profile。最后总结成 SOP 给团队复用。这种”测量-优化-再测量”的工作模式是性能工程师的日常。

21.9.32 一个完整调优案例的复盘

最后给一个虚构但典型的调优案例（聚合多个真实经验的合成）：

目标：Llama-7B 训练 step time 从 1.2s 降到 0.8s（提升 33%）。

Day 1: Baseline + 第一轮分析

profile 录 5 步, step time 1.2s
breakdown:
  - DataLoader: 200ms (17%)
  - forward: 350ms (29%)
  - backward: 500ms (42%)
  - optim: 150ms (12%)

最大头是 backward。trace 显示 backward 含大量小 kernel + cudaStreamSynchronize → 怀疑没用 compile。

修改：加 @torch.compile。

Day 2: 第二轮 profile

step time 0.95s (改善 21%)
  - DataLoader: 200ms (21%)
  - forward: 250ms (26%, fused)
  - backward: 350ms (37%, fused)
  - optim: 150ms (16%)

DataLoader 现在占比上升。trace 看 GPU 偶尔 idle 200ms → DataLoader 跟不上。

修改：num_workers=4 → 16、prefetch_factor=2 → 4、加 persistent_workers=True。

Day 3: 第三轮 profile

step time 0.85s (改善 11%)
  - DataLoader: 50ms (6%)
  - forward: 250ms
  - backward: 350ms
  - optim: 200ms (24%)  <- optim 占比上升

optim 时间没变但占比涨。trace 看 optim 内部含 fp32 → bf16 cast 的多次 H2D。原因：optimizer 用 fp32 但 model bf16，每次 step 全转换。

修改：换 bnb.optim.AdamW8bit（8-bit Adam，无大量 cast）。

Day 4: 第四轮 profile

step time 0.78s (达成目标!)
  - DataLoader: 50ms
  - forward: 250ms
  - backward: 350ms
  - optim: 130ms

总改善 35%（1.2s → 0.78s）。每天一次 profile-修改-验证循环，4 天达标。

复盘：

3 个独立优化叠加：compile + DataLoader + 8-bit Adam
每个优化都有 profile 数据驱动，不是凭直觉
每次只改一处：知道哪条改善多少
总成本：一个工程师 4 天 = 几千美元

如果不 profile 直接”猜”+ 全部改，可能：

改了不该改的（增加复杂度）
漏了关键优化
没法量化每条的贡献
总时间花更久

这是 profile 工具的真正价值 —— 让性能工程从猜测变成科学。这套 mindset + 工具应用比任何具体技术都更值得带走。

21.9.33 ProfilerStep 与 Schedule 状态机的源码细节

prof.step() 的内部 state machine（torch/profiler/profiler.py:_ProfilerState）：

class ProfilerAction(Enum):
    NONE = 0
    WARMUP = 1
    RECORD = 2
    RECORD_AND_SAVE = 3

每次 prof.step() 执行：

def step(self):
    self.step_num += 1
    action = self.schedule(self.step_num)

    if self.current_action == ProfilerAction.NONE and action == ProfilerAction.WARMUP:
        self._start_warmup()             # 开始预热, 但不录数据
    elif action == ProfilerAction.RECORD:
        self._start_trace()              # 真正开始录
    elif action == ProfilerAction.RECORD_AND_SAVE:
        self._stop_trace()
        self.on_trace_ready(self)        # 触发用户回调
        self.save_or_export()

    self.current_action = action

schedule(step_num) 返回当前应该的 action。schedule(wait=1, warmup=2, active=3, repeat=2) 的状态序列：

step 0: NONE
step 1: WARMUP
step 2: WARMUP
step 3: RECORD
step 4: RECORD
step 5: RECORD_AND_SAVE     <- repeat 第 1 次结束
step 6: NONE                 <- 进入下一 cycle 的 wait
step 7: WARMUP
...

理解状态机让你能写自定义 schedule：

def my_schedule(step_num):
    # 每 1000 步录一次, 每次录 5 步
    cycle_pos = step_num % 1005
    if cycle_pos == 0:
        return ProfilerAction.WARMUP
    elif cycle_pos < 5:
        return ProfilerAction.RECORD
    elif cycle_pos == 5:
        return ProfilerAction.RECORD_AND_SAVE
    else:
        return ProfilerAction.NONE

with profile(activities=[...], schedule=my_schedule, ...) as prof:
    for batch in loader:
        train_step(batch)
        prof.step()

这种自定义 schedule 让 long-running 训练能精确控制 profile 频率，不浪费空间也不漏掉关键步。生产 LLM 训练常用 1000+ 步一次的稀疏 schedule。

21.9.34 profile 与 CI/CD：性能回归检测

成熟工程团队把 profile 加进 CI 流水线 —— 每次 commit 自动跑 benchmark + profile，发现性能回归直接 block。

# CI workflow 片段
- name: Run perf benchmark
  run: |
    python benchmark.py --output results.json
    python compare_with_baseline.py results.json baseline.json

- name: Detect regression
  run: |
    if perf_regression > 5%:
      exit 1

benchmark.py 内部用 profile：

def benchmark():
    model = build_model()
    input = make_input()

    # warmup
    for _ in range(5):
        _ = model(input)
    torch.cuda.synchronize()

    # 测时间
    with profile(activities=[CPU, CUDA]) as prof:
        for _ in range(20):
            out = model(input)
        torch.cuda.synchronize()

    # 提取关键 metric
    return {
        'total_time': prof.events()[-1].time_range.end,
        'top_kernel': prof.key_averages()[0].name,
        'top_kernel_time': prof.key_averages()[0].cuda_time,
        'memory_peak': torch.cuda.max_memory_allocated(),
    }

CI 上每次 commit 跑这个、与 baseline 对比、回归 > 5% 报警。

实战收益：

早期发现：commit 引入 1ms regression，CI 立刻发现，比 1 周后 production 才发现成本低 100x
bisect 简单：知道哪个 commit 引起，git bisect 直接定位
持续优化：team 看到 perf 回归压力大、自然写更优代码

PyTorch 自家也有这套（pytorch/benchmark repo），跑几百个 model 每天 5 次、检测核心算子性能回归。把 profile 工具化用到 CI 是工程成熟度的标志。

理解这种用法让你看到 profile 不只是”调试工具”，是质量保障基础设施的一部分。生产 ML team 必备能力。

21.10 几条实战调优经验

1. 训练慢先开 profiler：不要凭直觉猜瓶颈。3 行代码加 schedule 录 5 步、看 chrome trace 通常能立刻发现问题

2. GPU 利用率波动 (sawtooth) 几乎肯定是 DataLoader：第 11 章 §11.11 完整诊断流程

3. trace 里 cudaStreamSynchronize 多 = 隐式同步：找代码里的 .item() / .cpu() / print(tensor)，挪到训练 hot loop 外

4. autograd profiler vs profiler：老 torch.autograd.profiler.profile 已被 torch.profiler 替代，新代码用后者

5. profiler overhead：record_shapes=True 与 with_stack=True 各让 profiler 慢 5-10%。性能极敏感的场景关掉

6. flame graph for Python：补充 py-spy record -- python train.py 看 Python 端火焰图，与 profiler 互补

7. nsys（Nsight Systems）是更专业的工具：CUDA 团队官方 profiler，能看 GPU 硬件计数器、CUPTI 全部细节。生产调优需要 nsys 时再上

8. 长跑训练 profiler 滚动落盘：每 1000 步触发一次 on_trace_ready 落盘 + 上传，监控异常步骤

9. trace 上传到对象存储 + URL 分享：team 协作 profile 数据。把 trace.json 上 S3 / GCS、配合 perfetto.dev 的 URL 加载，让团队成员零本地 setup 就能查看

10. profile 配合 unit test：核心算子写 assert kernel_time < threshold 单测。性能 regression 立刻在 PR review 时暴露，比上线后修便宜 100x

21.11 完整调优流程

flowchart TB
    Start[训练慢 / OOM]
    Start --> P1[1 加 profiler 录 5 步]
    P1 --> Look{看 chrome trace 哪类问题}

    Look -->|GPU sawtooth idle| DL[DataLoader 优化<br/>第 11 章 §11.11]
    Look -->|kernel gap 多| Sync[找隐式同步<br/>.item / .cpu 挪走]
    Look -->|kernel 调用密集| Comp[torch.compile<br/>fuse 小算子]
    Look -->|GPU bound 但慢| Algo[换 dtype<br/>第 20 章 AMP]
    Look -->|OOM| Mem[Memory Snapshot<br/>第 4 章 §4.11]

    Mem --> Frag{碎片化?}
    Frag -->|是| EXP[expandable_segments=True]
    Frag -->|否| Save[activation_checkpoint<br/>FSDP / cpu_offload]

    style P1 fill:#fef3c7,stroke:#f59e0b,stroke-width:2px

这是几十次调优经验沉淀的标准路径。每一步对应本书前面具体章节，profiler 是把它们串起来的诊断工具。

11. profile 数据进 wandb / mlflow：与训练 metric 一起跟踪，可以在事后回看”哪个 commit 引起 step time 涨”

12. 多 GPU 调优先单卡再扩：单卡 profile 找 forward / backward 瓶颈，扩到多卡再看通信瓶颈。混在一起调很难

13. 训练 vs 推理：训练用 torch.profiler、推理用 vLLM / SGLang 内置 metric。两套 mindset 不同（§21.9.23）

21.12 跨书关联

第 4 章 §4.11 Memory Snapshot：本章的显存分析底层
第 11 章 §11.11 DataLoader 瓶颈诊断：实战流程
第 17 章 §17.9 NCCL 调试：分布式 profile 与 NCCL_DEBUG 配合
第 14 章 §14.x Inductor：profile 看 fused kernel 与 output_code 联合调试
第 15 章 §15.6.30 TORCH_COMPILE_DEBUG：profile 与 compile debug 双工具配合
第 20 章 §20.5.16 量化精度调试：量化模型 profile 看 INT8 GEMM kernel 是否真的生效

21.13 设计启示

profiler 设计的核心思想：

第一：多源数据合并到统一 timeline：CPU 函数 + CUDA kernel + 显存事件 + 用户标注同步在一条时间轴上，让”为什么这里慢”能一眼看清

第二：Schedule API 让 profiler 可在生产用：录全程不可行，schedule 录采样让它适合长跑训练

第三：多层可视化：text 表（快速看 top）、chrome trace（看时间线）、TB 插件（自动分析）。每个层级解决不同诊断问题，组合起来覆盖完整诊断空间

第四：用户级标注 + 系统级采集合并：record_function 让 profiler 能体现用户的代码结构，与底层 ATen 算子事件融合

第五：profile 与编译栈共生：torch.compile 让 fused kernel 出现在 trace 里，profile 反过来验证 compile 收益。两套机制互相 feedback —— 改 compile 后 profile 看效果、profile 发现的瓶颈再回到 compile 调。这种循环是 v2.x 时代性能优化的核心 workflow

第六：测量是优化的前提：本章每个 case 都先 profile 测、再优化、再 profile 验证。这是与”凭直觉调”的根本差异。“没测量就不要优化”是性能工程的第一原则，profile 是这条原则的具体实现工具 —— 每次性能调优前先跑 5 步 profile、看 trace 形成 hypothesis，比直接改代码更快达到目标

下一章拆自定义算子 —— 当 PyTorch 内置算子不够用时，怎么写自家的 kernel 接进 dispatcher / autograd / torch.compile，把每一层都注册得完整、让 profile / autograd / compile 都能正确处理你的算子。