DeepSeek V4 源码剖析

第9章 Expert 与共享专家：SwiGLU + clip + 容量平衡

作者杨艺韬 · 5,213 字

第9章 Expert 与共享专家：SwiGLU + clip + 容量平衡

“An expert is not someone who has all the knowledge, but someone who has carefully bounded their domain.” —— 引自一位 ML 系统架构师

V4 的 384 个 expert 各自做着自己的”小宇宙”。让这些小宇宙不互相崩溃，靠的是 SwiGLU + clip + 容量平衡这三件兵器。

9.1 引子：384 个 SwiGLU 同时工作的工程问题

V4 一层 MoE 包含 385 个 expert：384 个 routed + 1 个 shared。每个 expert 是一个独立的 SwiGLU FFN，参数独立，输出在 token 维度上拼起来。

把这套架构展开看，至少有四个工程问题：

数值稳定性：384 个 SwiGLU 同时训练，每个 expert 的”激活幅度”必须保持可控——否则反向传播的梯度会因为某些极端 expert 而不稳
容量平衡：每个 expert 接收的 token 数必须接近——否则训练梯度对某些 expert 长期不更新
共享专家的角色：shared expert 要承担”通用知识”，但不能”主导”输出 —— route_scale=2.5 是为此设计的
TP 多卡分布：384 个 expert 必须在 8 卡 / 16 卡之间切分——通信成本要可控

V4 的 Expert 类用 20 行代码处理了前三个问题，TP 切分由 Block 与 MoE 类外部协调。本章拆这一切。

9.2 Expert 类的源码全景

class Expert(nn.Module):
    """Single MoE expert: SwiGLU FFN (w1, w2, w3). Computation in float32 for stability."""
    def __init__(self, dim: int, inter_dim: int, dtype=None, swiglu_limit=0):
        super().__init__()
        self.w1 = Linear(dim, inter_dim, dtype=dtype)
        self.w2 = Linear(inter_dim, dim, dtype=dtype)
        self.w3 = Linear(dim, inter_dim, dtype=dtype)
        self.swiglu_limit = swiglu_limit

    def forward(self, x: torch.Tensor, weights: Optional[torch.Tensor] = None) -> torch.Tensor:
        dtype = x.dtype
        gate = self.w1(x).float()
        up = self.w3(x).float()
        if self.swiglu_limit > 0:
            up = torch.clamp(up, min=-self.swiglu_limit, max=self.swiglu_limit)
            gate = torch.clamp(gate, max=self.swiglu_limit)
        x = F.silu(gate) * up
        if weights is not None:
            x = weights * x
        return self.w2(x.to(dtype))

20 行不到。但每一行都有讲究：

w1 / w3：投影到 inter_dim，分别是 SwiGLU 的”门控”和”上投影”
w2：投回 dim
dtype：可以是 BF16 / FP8 / FP4 e2m1（routed expert 在 V4 默认用 FP4）
swiglu_limit：clip 的上下界，V4 设 10
forward 内部把激活提升到 float32 计算，最后再降回原 dtype

9.2·补 384 个 expert 在 MoE 层中的协同结构

把每层 MoE 中 routed + shared expert 的协同关系画成数据流图：

flowchart LR
  X["token hidden<br/>x: [B*S, 7168]"] --> Gate{Gate}
  Gate -->|sqrtsoftplus + bias + topk| Indices["6 expert indices"]
  Gate -.原始 score 归一.-> Weights["6 routing weights"]
  
  X --> Shared["shared expert<br/>BF16, SwiGLU"]
  
  X -->|按 indices 散发| E1["routed expert 0<br/>FP4 e2m1"]
  X --> E2["routed expert 1<br/>FP4"]
  X --> Edot["..."]
  X --> EN["routed expert 383<br/>FP4"]
  
  E1 -->|× w0| Sum((+))
  E2 -->|× w1| Sum
  Edot --> Sum
  EN -->|× w5| Sum
  Shared -->|永远激活| Sum
  Sum --> Y["MoE 输出 y: [B*S, 7168]"]
  
  Indices -.决定哪些 expert 被激活.-> E1

注意三件事：

每 token 选 6 个 routed expert（不是全部 384 个），每 rank 仅算自己持有的部分
shared expert 永远激活——给每 token 提供”通用知识”基线
routed expert 输出乘上 routing weight，shared 输出不乘 weight——直接相加

这种”稀疏 routed + dense shared” 的组合是 DeepSeekMoE 论文的核心设计。

9.3 SwiGLU 的代数与几何

SwiGLU 是 V4 / V3 / Llama / Mistral 等现代模型的标准 FFN：

\text{SwiGLU}(x) = w_2(\text{SiLU}(w_1 x) \odot w_3 x)

SiLU(g) = g · sigmoid(g)——也叫 Swish。⊙ 是逐元素乘。

把这个公式拆成三步：

gate = w_1 x：门控通道
up = w_3 x：上投影通道
output = w_2 (SiLU(gate) · up)：门控后逐元素乘，再投回原维度

为什么 SwiGLU 比传统 ReLU FFN 表达力强？因为它有两条独立路径——gate 决定”开关”、up 决定”内容”，两者乘起来比单条 ReLU 路径有更丰富的表达。

V4 的代数特殊处理：

gate = self.w1(x).float()
up = self.w3(x).float()

无论输入是 BF16 还是 FP8，gate / up 的中间计算被提升到 float32。这是因为：

SiLU 的 sigmoid 在大输入时会溢出 BF16（exp(15) 已经超出 BF16 范围）
元素乘在大数值区域会爆 BF16
最后投回 w2 时再降回 dtype

float32 中间计算的代价是显存占用约翻倍——但中间张量只在每个 token 短暂存在，不写到 KV cache，对总显存影响小。

9.3·补 SwiGLU forward 路径的 dtype 流转

V4 Expert.forward 内部的 dtype 切换非常频繁——把它画成 dtype 流转图：

flowchart TB
  Input["x: BF16 [B,S,7168]"] --> W1["w1 GEMM (FP4 weight)"]
  Input --> W3["w3 GEMM (FP4 weight)"]
  W1 --> Gate["gate: BF16 → .float() → FP32"]
  W3 --> Up["up: BF16 → .float() → FP32"]
  Gate --> Clamp1{swiglu_limit?}
  Up --> Clamp2{swiglu_limit?}
  Clamp1 -->|clamp max=10| GateClipped["gate ≤ 10<br/>FP32"]
  Clamp2 -->|clamp -10..10| UpClipped["up ∈ -10..10<br/>FP32"]
  GateClipped --> SiLU["F.silu(gate)<br/>FP32"]
  SiLU --> Mul((×))
  UpClipped --> Mul
  Mul --> Inter["intermediate: FP32 [B,S,3072]"]
  Inter --> Wmul{有 weights?}
  Wmul -->|是| WeightedMul["× routing weight"]
  Wmul -->|否,如 shared expert| Skip[直接走]
  WeightedMul --> Cast["x.to(BF16)"]
  Skip --> Cast
  Cast --> W2["w2 GEMM (FP4 weight, 输入 BF16)"]
  W2 --> Output["output: BF16 [B,S,7168]"]

精度链的 5 个关键决策：

w1/w3 输入是 BF16：GEMM 内部用 FP4 weight + BF16 act → FP32 累加 → BF16 输出
gate / up 升 FP32：SwiGLU 内部计算敏感，必须 FP32
clamp 在 FP32 上做：避免 BF16 上的 clamp 精度抖动
intermediate 保 FP32：直到 w2 之前
w2 输入降回 BF16：让 w2 GEMM 走 FP4 × BF16 路径

这条精度链是 V4 让”FP4 weight 不破坏 SwiGLU 表达力”的工程秘诀。

9.4 swiglu_limit=10：一个看似多余的 clip

V4 的 config.json 里：

"swiglu_limit": 10.0

对应源码：

if self.swiglu_limit > 0:
    up = torch.clamp(up, min=-self.swiglu_limit, max=self.swiglu_limit)
    gate = torch.clamp(gate, max=self.swiglu_limit)

把 gate clamp 到 max=10（不限下界），把 up clamp 到 [-10, 10]。为什么需要这个 clip？

原因一：FP4 expert 的反量化精度

V4 的 routed expert 权重是 FP4 e2m1。FP4 的动态范围有限（约 [0.5, 6]），反量化后得到 BF16 权重。但这个反量化过程会偶尔产生”异常大值”——某些 channel 的 weight 反量化后可能超出训练时见过的范围。

如果不 clamp，这些异常 weight 会让 gate / up 的某些元素出现 e^10 量级的爆炸，反向梯度炸毁训练。

原因二：SiLU 在大输入时的数值不稳定

SiLU(g) = g · sigmoid(g)。在 g = 10 时 SiLU(10) ≈ 10；在 g = 100 时 SiLU(100) ≈ 100。看起来线性，但当 g 与 up 相乘后再投影回 w2 时，这种”线性放大” 会被 w2 的 weight matrix 进一步放大——最终输出可能有 e^15 量级。

把 gate clamp 到 10 限制了 SiLU 的最大输出，从源头切断爆炸路径。

原因三：训练初期的稳定

训练初期 weight 是随机的，gate / up 可能出现极端值。clamp 让这些极端值”被吸收”在 ±10 内，避免训练初期的数值崩溃。

为什么是 10 而不是 5 或 100？这是 V4 团队从训练实战调出来的——10 大致对应 BF16 + FP4 反量化在 1.6T 模型上的 99.9% 分位的安全数值。

9.5 共享专家：永远激活的”通用知识库”

V4 一层 MoE 有 1 个 shared expert（n_shared_experts=1），它永远参与每 token 的计算：

# MoE.forward 末尾
y += self.shared_experts(x)

shared expert 与 routed expert 的差别：

维度	Shared expert	Routed expert
是否激活	永远	每 token 选 6 / 384
输入 token	全部 token	选定的 token
dtype	BF16（默认）	FP4 e2m1
route weight	无	weight × 2.5
角色	通用知识	专门知识

shared expert 的存在是 DeepSeekMoE 论文（arXiv:2401.06066）的核心创新之一。它解决了”细粒度专家容量太小、放不下通用知识”的问题——把通用知识独立抽出来，放到一个永远激活的 expert 里，让 384 个 routed expert 专注于”差异化的专门知识”。

V4 的源码里 shared expert 的实例化：

# MoE.__init__
self.shared_experts = Expert(args.dim, args.moe_inter_dim, swiglu_limit=args.swiglu_limit)

注意没有传 dtype 参数——意味着用默认 dtype=None，最终走 default_dtype（BF16 或 FP8，取决于 ModelArgs.dtype）。这与 routed expert 的 FP4 形成对比。

shared expert 用更高精度是有道理的——它对每个 token 都贡献，精度损失会被所有 token 累积。FP4 的 1.6T 主要靠 routed expert 的”384 倍稀疏”来摊销精度损失，shared expert 没有这种摊销，必须保高精度。

9.6 MoE.forward：384 expert 怎么并行执行

MoE.forward 是把 Gate + 384 routed + 1 shared 串起来的核心：

def forward(self, x: torch.Tensor, input_ids: torch.Tensor) -> torch.Tensor:
    shape = x.size()
    x = x.view(-1, self.dim)
    weights, indices = self.gate(x, input_ids.flatten())
    y = torch.zeros_like(x, dtype=torch.float32)
    counts = torch.bincount(indices.flatten(), minlength=self.n_routed_experts).tolist()
    for i in range(self.experts_start_idx, self.experts_end_idx):
        if counts[i] == 0:
            continue
        expert = self.experts[i]
        idx, top = torch.where(indices == i)
        y[idx] += expert(x[idx], weights[idx, top, None])
    if world_size > 1:
        dist.all_reduce(y)
    y += self.shared_experts(x)
    return y.type_as(x).view(shape)

关键 6 步：

步骤 1：扁平化 token 维度

x = x.view(-1, self.dim)

把 [B, S, dim] 压成 [B × S, dim]——便于按 token 索引。

步骤 2：调用 Gate 拿 weights / indices

weights, indices = self.gate(x, input_ids.flatten())

weights: [B*S, 6]，indices: [B*S, 6]。

步骤 3：bincount 算每个 expert 的接收数

counts = torch.bincount(indices.flatten(), minlength=self.n_routed_experts).tolist()

如果某个 expert 的 count=0，跳过它的 forward 节省计算。

步骤 4：逐 expert 执行（仅本 rank 持有的 expert）

for i in range(self.experts_start_idx, self.experts_end_idx):
    if counts[i] == 0:
        continue
    expert = self.experts[i]
    idx, top = torch.where(indices == i)
    y[idx] += expert(x[idx], weights[idx, top, None])

experts_start_idx ~ experts_end_idx 是本 TP rank 持有的 expert 范围（n_routed_experts // world_size）。每个 rank 只处理自己的 expert，跳过其他 rank 的。

torch.where(indices == i) 找出”哪些 token 选了 expert i”——返回 (token_idx, position_in_topk)。把这些 token 的 hidden 与对应权重传给 expert，输出 in-place 加到 y 上。

步骤 5：跨 rank all_reduce

if world_size > 1:
    dist.all_reduce(y)

每 rank 只算自己持有的 expert 的输出贡献——通过 all_reduce 把所有 rank 的部分输出相加，得到完整 y。

步骤 6：加 shared expert 输出

y += self.shared_experts(x)

shared expert 对所有 token 都跑，输出加到 y。这一步在每个 rank 上独立计算（因为 shared expert 在每个 rank 都有副本）——之所以可行是因为 V4 让 shared expert 走 BF16，不走 FP4 / 不切分到不同 rank。

9.7 Expert 在 Tensor Parallel 下的切分

V4 用 8 卡 TP 时，384 个 routed expert 按 rank 切分：

rank 0：expert 0 ~ 47（48 个 expert）
rank 1：expert 48 ~ 95
…
rank 7：expert 336 ~ 383

每个 rank 的 nn.ModuleList 里只有自己持有的 48 个 expert 是真实模块，其余位置是 None：

self.experts = nn.ModuleList([
    Expert(args.dim, args.moe_inter_dim, dtype=expert_dtype, swiglu_limit=args.swiglu_limit)
    if self.experts_start_idx <= i < self.experts_end_idx else None
    for i in range(self.n_routed_experts)
])

这种”稀疏 ModuleList”模式让代码看起来像所有 384 个 expert 都存在，但实际只有 48 个被实例化。其他 rank 的 expert 通过 None 占位，访问时跳过。

这种设计的工程取舍：

✅ 代码看起来与”单 GPU 全 expert” 一致——少改逻辑
✅ for i in ...: if None: continue 让 dispatch 简单
❌ 每个 rank 都要存 384 个 None 引用——内存开销小但有额外 Python overhead

V4 选了”看似浪费、实则简化代码”的写法——这种工程美学贯穿了整个源码。

9.8 Expert 容量与”溢出”问题

MoE 训练时容易出现的一个问题是 expert 容量溢出——某个 expert 的接收 token 数超过它的处理能力。V4 的处理：

实际上 V4 源码里没有显式的容量限制——expert 接收多少就处理多少。这是因为 V4 的 noaux_tc + bias 机制已经在动态调整每个 expert 的接收量，不需要硬性容量限制。

但训练时如果出现极端不均衡（某个 expert 接收 batch 内 50% 的 token），有几个补救：

bias 在下次更新时会大幅压低这个 expert 的 score
训练 step 的耗时会被这个 expert 拖慢——其他 rank 在等它
长期看会自然均衡

V4 的实测训练曲线（README 公开）显示 384 expert 的 load 分布最大 / 最小比约 1.5——非常均衡。这是 noaux_tc + sqrtsoftplus + hash 前 3 层的协同效果。

9.9 动手实验：跑一个 mini Expert + MoE

import torch
import torch.nn as nn
import torch.nn.functional as F

class MiniExpert(nn.Module):
    def __init__(self, dim, inter_dim, swiglu_limit=10):
        super().__init__()
        self.w1 = nn.Linear(dim, inter_dim)
        self.w2 = nn.Linear(inter_dim, dim)
        self.w3 = nn.Linear(dim, inter_dim)
        self.swiglu_limit = swiglu_limit

    def forward(self, x, weights=None):
        gate = self.w1(x).float()
        up = self.w3(x).float()
        if self.swiglu_limit > 0:
            up = up.clamp(-self.swiglu_limit, self.swiglu_limit)
            gate = gate.clamp(max=self.swiglu_limit)
        x = F.silu(gate) * up
        if weights is not None:
            x = weights * x
        return self.w2(x.to(torch.float32)).type_as(self.w2.weight)


class MiniMoE(nn.Module):
    def __init__(self, dim=128, inter_dim=512, n_experts=8, topk=2):
        super().__init__()
        self.dim = dim
        self.topk = topk
        self.gate = nn.Linear(dim, n_experts)
        self.experts = nn.ModuleList([MiniExpert(dim, inter_dim) for _ in range(n_experts)])
        self.shared = MiniExpert(dim, inter_dim)

    def forward(self, x):
        B, S, _ = x.shape
        x_flat = x.view(-1, self.dim)
        scores = F.softplus(self.gate(x_flat)).sqrt()
        weights, indices = scores.topk(self.topk, dim=-1)
        weights = weights / weights.sum(dim=-1, keepdim=True)

        y = torch.zeros_like(x_flat)
        for i, expert in enumerate(self.experts):
            mask = (indices == i).any(dim=-1)
            if not mask.any():
                continue
            sub_x = x_flat[mask]
            # 找出每个 sub_x 在哪些 topk 槽位选到 i
            sub_w = (weights * (indices == i).float()).sum(dim=-1, keepdim=True)[mask]
            y[mask] += expert(sub_x, sub_w)

        y += self.shared(x_flat)
        return y.view(B, S, -1)


# 测试
moe = MiniMoE()
x = torch.randn(2, 16, 128)
out = moe(x)
print(out.shape)   # [2, 16, 128]

跑通这个 mini 版本后，再回看 V4 源码的 MoE.forward——会觉得”原来 384 expert 与 8 expert 的代码结构没本质差别，只是 expert 数量和 TP 切分变了”。

9.9·补 Expert 训练动力学：从随机到差异化

V4 的 384 个 routed expert 在训练开始时是随机初始化、几乎相同。怎么从”全相同”演化到”差异化”，是 MoE 训练的核心动力学。

初期（前 5% steps）：

Gate 路由还在 warmup，每个 expert 被分到的 token 几乎是随机的。每 expert 看到的”输入 token 分布”接近全 vocab 的均匀分布。此时所有 expert 在做同一件事——学习”对随机 token 输出有意义 hidden”。

这个阶段的关键是 gradient 平等下降——每个 expert 收到的梯度量大致相等，没有 expert 被过度训练 / 欠训练。

中期（5%-30% steps）：

Gate 的 weight 开始学差异——某些 expert 开始接收”代码偏多”的 token，某些开始接收”中文偏多”的 token。每个 expert 的”输入分布”不再均匀，开始有特定特征。

此时互相强化的循环启动：

expert A 接收更多代码 token → 它的 SwiGLU weight 学到更适合代码的表达
学到代码表达后，Gate 的 score 网络看到”代码 token + expert A 的 hidden 匹配度高” → 给 A 更高 score
下一步代码 token 更倾向被分到 A → 进一步强化

这种正反馈循环就是 MoE “差异化”的核心动力。

后期（30%-80% steps）：

差异化模式已经稳定。每个 expert 的”擅长领域”基本固定。SwiGLU 的 w1/w2/w3 weight 反映 expert 的 specialization——如果你 SVD 它们，可以看到清晰的”主成分”对应该 expert 的擅长方向。

精修期（80%-100% steps）：

lr 衰减让 expert 在固定差异化模式下精修内部细节。此时如果观察 expert 的”激活模式”，会看到非常清晰的”分领域响应”——某些 expert 仅在代码 prompt 上激活高、某些仅在数学 prompt 上激活高。

V4 vs V3 的训练动力学差异：

V4 的 384 expert 比 V3 的 256 多 50%。在同等”每 expert 训练 token 量”目标下，V4 需要 50% 更多总训练 token——这是 32T vs V3 的 14.8T 的部分原因。但 384 expert 让差异化更细粒度——每个 expert 可以专攻更窄的子领域，整体模型表达力更强。

理解这条训练动力学曲线对 fine-tune 极重要：如果你在 V4 上做领域 fine-tune（比如训练 V4-Code），不要让 lr 大到破坏精修期建立的 specialization——lr 应该设到 pretrain 的 1/100 量级。

9.9·补·补 shared expert 的”承重墙”角色

shared expert 在 V4 的 MoE 中扮演”承重墙”——它承担”通用知识”的存储，让 384 routed expert 可以专注于”差异化知识”。把这个角色摆深入：

承重墙作用一：稳定输出基线

每个 token 都经过 shared expert——意味着无论 routing 选了哪 6 个 routed expert，输出至少有 shared expert 的”基线贡献”。这避免了”所有 routed expert 同时’失误’导致输出崩塌”的灾难情况。

承重墙作用二：减轻 routed expert 的”通用知识”负担

如果没有 shared expert，每个 routed expert 都必须既学”通用语言能力”又学”特定领域知识”——容量被分摊。有了 shared expert 兜底，routed expert 可以完全放弃通用知识——把 100% 容量用于差异化。这是 DeepSeekMoE 论文论证的核心收益。

承重墙作用三：训练梯度的”基础流”

shared expert 永远参与每 token 计算——它收到的梯度是”所有 token”的梯度均值，是稳定的”基础流”。routed expert 只在被选中时收到梯度——梯度流是脉冲式的、不稳定。shared expert 的稳定梯度流给 input embedding / lm_head 等”全局共享层” 提供持续的训练信号——避免它们因为只有 routed expert 的脉冲梯度而训不动。

承重墙作用四：推理时的”延迟稳定器”

推理时每 token 必算 shared expert——它的延迟是 deterministic 的（不依赖 routing 选择）。这让 V4 的 token 延迟分布比纯 routed MoE 更稳定——某些 token 不会因为”被分到忙的 rank” 而延迟暴增。

V4 的 1 个 shared expert 是经过权衡的——多到 2-3 个会让”通用知识太占容量”，少到 0 个会让 routed expert 训练不稳。这个数字在 DeepSeekMoE 论文中被论证为”最优”——V4 沿用这个结论。

9.10 延伸阅读

DeepSeekMoE 论文（arXiv:2401.06066）：shared expert 的源头
SwiGLU 论文（arXiv:2002.05202）：SwiGLU 提出
GShard 论文（arXiv:2006.16668）：早期 MoE 容量管理
本书第 7 章：Gate 输出如何驱动本章的 expert 选取
本书第 12 章：FP4 expert 权重的精度细节

9.10·补 Expert 在 V4 vs 其他 MoE 上的实现细节差异

V4 的 Expert 实现与同期其他 MoE 模型在几个细节上有差异，了解这些差异有助于跨模型迁移代码。

差异 1：FP4 vs FP8 expert weight

V4：routed expert 是 FP4 e2m1，shared expert 是 BF16。 Mixtral / Qwen3：所有 expert 是 BF16 / FP8。

迁移影响：从 V4 借鉴 Expert 类时，需要正确处理 dtype 路径——简单设 dtype=BF16 会改变模型行为。

差异 2：swiglu_limit 的存在与否

V4：swiglu_limit=10，强制 clip。 Llama 4 / Mistral：通常没有 clip。

迁移影响：clip 有 V4 的 FP4 数值稳定性需求——如果你的模型不是 FP4，可以去掉 clip。

差异 3：weights 参数的位置

V4：weights 在 Expert.forward 里被乘到 SwiGLU 输出上 —— x = weights * x。 Mixtral：weights 在 MoE 层外面被加权——expert 输出乘 weights 再 sum。

差异是”weights 乘的位置”——V4 在 expert 内部、Mixtral 在 expert 外部。两种实现数学上等价，但 V4 的方式让 expert 是”自包含”的，便于单元测试。

差异 4：FP32 中间计算

V4：gate / up / output 在 FP32 中算，最后转回 dtype。 Mixtral / Llama 4：通常全程 BF16。

差异原因：V4 的 swiglu_limit + FP4 weight 让 BF16 中间计算可能溢出，必须 FP32。其他模型 BF16 weight 没这个问题。

差异 5：.float() 还是 .to(torch.float32)

V4 用 .float() —— 这是 PyTorch 习惯写法，等价于 .to(torch.float32)。

迁移影响：无差异，但代码风格保持一致更易读。

理解这 5 个差异后，从 V4 借鉴 Expert 类到其他模型只需要”裁掉 V4 特定的部分（FP4 / clip / FP32 中间）” —— 核心 SwiGLU 结构通用。

9.10·补 Expert 在”专家利用率监控” 上的工程实践

部署 V4 后必须监控每层 MoE 的 expert 利用率分布——这是检测训练稳定性 + 部署健康度的核心指标。

正常分布：

V4 训练好的模型在生产中，每层 384 expert 的接收 token 比例分布应该是：

最高频 expert / 最低频 expert 的比例约 1.5x - 3x
标准差约均值的 30-50%
没有 expert 长期 0 load

异常 1：长尾分布（某些 expert 几乎不用）

症状：某 10% expert 接收 token 量 < 平均值的 10%。含义：bias 调节不充分，某些 expert”训练不足”。应对：检查 bias 配置、必要时回滚到上一版本。

异常 2：路由塌陷（少数 expert 主导）

症状：top 5% expert 接收 token 量 > 总量的 30%。含义：路由学到病态分布。应对：紧急切换到其他实例、调查根因。

异常 3：分布漂移（每次 forward 分布显著变化）

症状：相邻两次相同 batch forward 的 expert load 差异 > 50%。含义：数值不稳定（如 sqrtsoftplus 在某些边界数值上抖动）。应对：检查 GPU 硬件、检查输入数据是否包含特殊 token。

监控的实施：

在 vLLM / SGLang 中可以添加一个 hook：每次 MoE forward 后记录 indices.flatten().bincount()，定期上报到 Prometheus。Grafana 配仪表板显示每层的 expert load 直方图——一眼看出异常。

这套监控的实现量约 2-3 天工程师工作——但对生产稳定性极重要。建议作为部署 V4 的必备组件。

9.11 本章小结

V4 的 Expert 是标准 SwiGLU FFN，但加了 swiglu_limit=10 的 clip 防止数值崩溃
routed expert 走 FP4 e2m1，shared expert 走 BF16——精度分级
MoE.forward 用 bincount + 跳过空 expert + per-rank 循环 + all_reduce 实现高效 384 expert 调度
TP 切分用”稀疏 ModuleList + None 占位”——代码简洁，每 rank 只实例化自己的 expert
V4 没有硬性容量限制——靠 noaux_tc bias 动态均衡
shared expert 永远参与每 token 计算——承担”通用知识”，配合 384 routed expert 的”细分知识”

第 10 章我们离开 MoE 引擎，进入 V4 最具实验性的设计：Hyper-Connections——hc_mult=4 替代传统残差。