<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:ap="https://www.grepcode.cn/ns/ai-papers">
  <channel>
    <title>Transformer 从零到一 on GrepCode</title>
    <link>https://www.grepcode.cn/tf/index.html</link>
    <description>Recent content in Transformer 从零到一 on GrepCode</description>
    <generator>Hugo</generator>
    <language>zh-CN</language>
    <lastBuildDate>Sun, 17 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://www.grepcode.cn/tf/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>[TF-001] Transformer 从零到一——为什么、是什么、怎么实现</title>
      <link>https://www.grepcode.cn/tf/001-transformer-from-scratch.html</link>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      <guid>https://www.grepcode.cn/tf/001-transformer-from-scratch.html</guid>
      <description>从 RNN 的串行瓶颈出发，走到 Multi-Head Attention 的全并行设计，最后逐行对照 model.py 看懂每一行代码。涵盖 Q/K/V 几何直觉、scaled dot-product、Pre-LN、causal mask、从 NMT 到 GPT 的演进。</description>
      <ap:role>nio: 方向定义、逐段审阅、术语校正（漏斗→多通道移液器）、架构立场校准（去除 TF 偏好）、视角引入（寄存器、链→树→图、没有银弹、四相）； opencode: 初稿撰写、代码引用与行号标注、文献检索、编辑执行</ap:role>
      <ap:review>本文由 nio 于 2026-05-17 至 05-19 多轮审阅。关键修改：漏斗→多通道移液器；补 LSTM 对比 shape 表；补数据结构观察（O(n)→O(1)是矩阵拓扑变化）；补寄存器视角；补树→链→图谱系；校准 RNN 定位（英雄之刃）；全文脱 TF 偏好。所有修改已执行并经 nio 二次确认。</ap:review>
      <content:encoded><![CDATA[<h1 id="transformer-从零到一">Transformer 从零到一</h1>
<blockquote>
<p>不是因为它更聪明——是因为它把整个句子当一张全连接图，一次算完。</p>
</blockquote>
<h2 id="1-rnn-为什么不行">1. RNN 为什么不行</h2>
<p>RNN 的翻译流程：读第 1 个词 → 更新隐藏状态 → 读第 2 个词 → 更新隐藏状态 → … → 读最后一个词 → 开始解码。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>&#34;die katze schläft&#34;  →  h1 → h2 → h3 → 解码 &#34;the cat sleeps&#34;
</span></span></code></pre></div><p>这里有一个根本问题：<strong>50 个词共用一条链</strong>。第 50 个词的梯度要穿过 49 层才能传到第 1 个词。LSTM 用三个门（遗忘门、输入门、输出门）延缓了梯度消失，但治标不治本。</p>
<p>用 Phase 1 LSTM 跑 IWSLT14 翻译，最佳 BLEU 只有 3.06。跑 5 个 epoch 就过拟合——梯度链太长，信息衰减太快，模型记不住长程依赖。</p>
<p>Transformer 的方案是：<strong>不串行了。整个句子一次性扔进去，让每个词同时关联所有其他词。</strong></p>
<p>但这个方案不免费——代价是两方面的膨胀：</p>
<p><strong>1. Hidden 膨胀。</strong> LSTM 的 encoder 输出是 <code>(num_layers, B, hidden_size)</code>，比如 2 层 256 维 = <code>(2, B, 256)</code>，只存最终状态 512 个浮点数。Transformer 的 encoder 输出是 <code>(B, S, d_model)</code> ——每个位置都保留了完整的 d_model 维向量。句长 S=128、d_model=512 时，encoder 输出是 65536 个浮点数，<strong>是 LSTM hidden 的 128 倍</strong>。</p>
<p><strong>2. 空间复杂度 O(n²)。</strong> Self-Attention 计算所有词对之间的分数，Attention 矩阵是 (B, heads, S, S)。S=128 时是 128² = 16384 个元素/头，S=512 时是 262144。相比之下 LSTM 的复杂度是 O(n)——每步只看当前输入和 hidden state。这就是为什么长序列（长文本、高像素图像）上原生 Attention 会撞显存墙。</p>
<p>代价之三是：<strong>需要位置编码</strong>。Attention 本身不认识顺序——而 RNN 天然有位置信息，因为词是一个一个按顺序灌进去的，位置即处理顺序，不需要额外标记。Transformer 一次性吞入整句，词的先后关系全部丢失，必须显式注入 sin/cos 位置信号补回来。常被误称为&quot;时序信息&quot;，其实不是时序——是<strong>位置信息</strong>，跟时间无关，跟排在哪儿有关。</p>
<p><strong>那代价花得值吗？</strong> 一个直接的反问：把 LSTM 的 hidden_size 膨胀到和 Transformer 同等参数量，能追上来吗？事实证明不能。LSTM BiLSTM + Attention 在 IWSLT14 上的上限是 BLEU 3.77，而同等参数量的 Transformer 是 11.49。</p>
<p>差距不是参数量——是信息流动方式。</p>
<p><strong>改变这个架构，改变的是矩阵本身的形状。</strong> LSTM 的梯度沿时间步串行累积——error signal 穿过 n 层 cell state 逐层回传，路径长度 O(n)。Transformer 的梯度通过 Attention 权重矩阵直连——第 i 个位置到第 j 个位置的梯度走一次 matmul 的链式法则就到，路径长度 O(1)。前者是串联电路，后者是全连接总线的平行电路。</p>
<blockquote>
<p><strong>这是一个数据结构层面的观察。</strong> O(n) 和 O(1) 不是运算步数的差异——是矩阵的拓扑结构从链变成了图。RNN 的隐藏状态链是一个一维序列，梯度沿这条线逐站传递。Attention 的权重矩阵是一个二维邻接矩阵，任意两点直连，梯度在各位置间均匀分布。<strong>形状的改变先于算法的改变。</strong> 后面拆解的 Self-Attention、Multi-Head、LayerNorm，都是在&quot;全连接邻接矩阵&quot;这个数据结构上展开的具体计算规则。</p>
</blockquote>
<table>
  <thead>
      <tr>
          <th></th>
          <th>LSTM (即使扩大)</th>
          <th>Transformer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>信息瓶颈</strong></td>
          <td>单管滴管：整句逐个吸入 <code>(num_layers, B, hidden_size)</code> 一个定长向量，解码器只能从这根管里拆</td>
          <td><code>(B, S, d_model)</code> 按位置展开，解码器直接查源句任意位置</td>
      </tr>
      <tr>
          <td><strong>梯度路径</strong></td>
          <td>O(n)：第 50 个词的梯度穿 49 层 RNN step 才到第 1 个词</td>
          <td>O(1)：任意两个位置通过 Attention 权重直连</td>
      </tr>
      <tr>
          <td><strong>并行性</strong></td>
          <td>串行：必须等上一步算完，GPU 大量空闲</td>
          <td>全并行：所有位置一次 matmul 算完</td>
      </tr>
  </tbody>
</table>
<p>LSTM 的 hidden state 本质是一个<strong>单管滴管</strong>——不管管子多粗（hidden_size=512、1024、2048），一次只能吸一个孔，50 个词的语义必须逐个挤过去。Transformer 是<strong>多通道移液器</strong>——多个吸头对准矩阵的一整行，一次操作同时处理所有位置。这才是根本差别。</p>
<p>RNN 不是废了。它在这里只是碰巧被拿来解决 NLP——NLP 需要全距离依赖，RNN 的链式拓扑先天吃亏。但在流式信号处理、实时控制、低延迟嵌入式推理里，<strong>时序本身就是信息</strong>，串行不是缺陷是特性，RNN 依然是英雄之刃。</p>
<p>怎么做到的？答案就是 Self-Attention。</p>
<h2 id="2-self-attention--qkv">2. Self-Attention —— Q、K、V</h2>
<p>Self-Attention 的直觉：<strong>我是一个词，我拿三个问题去问整句话里的每个词。</strong></p>
<table>
  <thead>
      <tr>
          <th>组件</th>
          <th>含义</th>
          <th>直觉</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Q (Query)</strong></td>
          <td>&ldquo;我想找什么？&rdquo;</td>
          <td>当前词的搜索意图</td>
      </tr>
      <tr>
          <td><strong>K (Key)</strong></td>
          <td>&ldquo;我有什么？&rdquo;</td>
          <td>每个词的标签/索引</td>
      </tr>
      <tr>
          <td><strong>V (Value)</strong></td>
          <td>&ldquo;我值多少？&rdquo;</td>
          <td>每个词的实际语义内容</td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>Q、K、V 都来自同一个输入序列，不是来自词表。</strong> 输入张量 <code>(B, T, d_model)</code> 分别过三个不同的线性投影 <code>W_q</code>、<code>W_k</code>、<code>W_v</code>（都是 <code>nn.Linear(d_model, d_model)</code>），得到三个形状完全相同的张量。词表只在两处出现——入口的 Embedding（token → d_model，本质是一键查表）和出口的 Linear（d_model → vocab_size logits）。中间的 Q、K、V 全程在 d_model 空间运算，跟词表大小无关。</p>
</blockquote>
<blockquote>
<p><strong>这些 W 参数本质是寄存器。</strong> 不管是 RNN 的 <code>W_hh</code>（h_{t-1}→h_t 的时间连接），还是 Transformer 的 <code>W_q/W_k/W_v</code>（位置 i→位置 j 的图连接），里面存的都是一个一个的标量参数——梯度反向传播时，值就积累在这些寄存器里。RNN 积的是串行时间量，Transformer 积的是全连接边权重。拓扑不同，寄存的逻辑相同。</p>
<p><strong>这些寄存器之所以排列成矩阵，是为了适应物理世界的算力架构。</strong> GPU 的并行单元按矩阵瓦片调度——寄存器排成矩阵，才能被一次 <code>matmul</code> 调起所有通道。不是&quot;矩阵天然适合存参数&quot;，是&quot;参数必须排成矩阵才能用硬件&quot;。</p>
</blockquote>
<p>流程：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>1. 每个词过一个线性层 → 得到自己的 Q, K, V
</span></span><span style="display:flex;"><span>2. Q 和所有 K 做点积 → 得到 &#34;当前词和每个词的匹配分数&#34;
</span></span><span style="display:flex;"><span>3. 除以 √d_k → 缩放防梯度饱和（点积方差随维度增长）
</span></span><span style="display:flex;"><span>4. softmax → 归一化成分数权重
</span></span><span style="display:flex;"><span>5. 权重 × V → 加权求和 → 当前词的上下文表示
</span></span></code></pre></div><p>公式：</p>
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$<p><strong>为什么除以 √d_k？</strong> 假设 Q 和 K 的每个元素独立同分布，均值为 0，方差为 1。点积 <code>q·k</code> 的方差是 d_k。d_k 大了之后，点积值可能很大，会让 softmax 落入梯度平坦区（极低或极高的 softmax 值梯度接近 0）。除以 √d_k 把方差压回 1，保持 softmax 在&quot;敏感区&quot;。</p>
<p>代码对照——<code>model.py:59</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>scores <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>matmul(Q, K<span style="color:#f92672">.</span>transpose(<span style="color:#f92672">-</span><span style="color:#ae81ff">2</span>, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>)) <span style="color:#f92672">/</span> math<span style="color:#f92672">.</span>sqrt(self<span style="color:#f92672">.</span>d_k)
</span></span></code></pre></div><h3 id="21-mask--边界与未来">2.1 Mask —— 边界与未来</h3>
<p>让模型知道&quot;哪句话在哪结束、哪个词不能偷看&quot;。Encoder 屏蔽 padding，Decoder 用下三角屏蔽未来词。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Encoder mask (model.py:115): padding 位置不能参与</span>
</span></span><span style="display:flex;"><span>src_mask <span style="color:#f92672">=</span> (src <span style="color:#f92672">!=</span> <span style="color:#ae81ff">0</span>)<span style="color:#f92672">.</span>unsqueeze(<span style="color:#ae81ff">1</span>)<span style="color:#f92672">.</span>unsqueeze(<span style="color:#ae81ff">2</span>)  <span style="color:#75715e"># (B, 1, 1, S)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Decoder mask (model.py:167): 未来词不能偷看</span>
</span></span><span style="display:flex;"><span>tgt_mask <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>tril(torch<span style="color:#f92672">.</span>ones(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>, T, T, device<span style="color:#f92672">=</span>tgt<span style="color:#f92672">.</span>device))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 比如 T=4:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># [[1, 0, 0, 0],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  [1, 1, 0, 0],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  [1, 1, 1, 0],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  [1, 1, 1, 1]]</span>
</span></span></code></pre></div><h2 id="3-multi-head--8-个专家各自打分">3. Multi-Head —— 8 个专家各自打分</h2>
<p>一个 Attention Head 只从一种角度衡量&quot;相似度&quot;。Multi-Head 让 8 个头各自在低维子空间（<code>d_k = d_model / 8</code>）独立计算 Attention，然后拼接起来。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>d_model = 512, num_heads = 8 → d_k = 64
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>不是 &#34;把 512 维劈成 8 段&#34; —— 
</span></span><span style="display:flex;"><span>而是 &#34;每个头都有权看全部 512 维，但只输出 64 维&#34;
</span></span></code></pre></div><p>代码——<code>model.py:46-65</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># 四个线性投影</span>
</span></span><span style="display:flex;"><span>self<span style="color:#f92672">.</span>W_q <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Linear(d_model, d_model)   <span style="color:#75715e"># 512 → 512（拆分到 8 头 × 64）</span>
</span></span><span style="display:flex;"><span>self<span style="color:#f92672">.</span>W_k <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Linear(d_model, d_model)
</span></span><span style="display:flex;"><span>self<span style="color:#f92672">.</span>W_v <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Linear(d_model, d_model)
</span></span><span style="display:flex;"><span>self<span style="color:#f92672">.</span>W_o <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Linear(d_model, d_model)   <span style="color:#75715e"># 拼接后投影回 512</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">forward</span>(self, query, key, value, mask<span style="color:#f92672">=</span><span style="color:#66d9ef">None</span>):
</span></span><span style="display:flex;"><span>    B <span style="color:#f92672">=</span> query<span style="color:#f92672">.</span>size(<span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>    Q <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>W_q(query)<span style="color:#f92672">.</span>view(B, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, self<span style="color:#f92672">.</span>num_heads, self<span style="color:#f92672">.</span>d_k)<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>    K <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>W_k(key)<span style="color:#f92672">.</span>view(B, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, self<span style="color:#f92672">.</span>num_heads, self<span style="color:#f92672">.</span>d_k)<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>    V <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>W_v(value)<span style="color:#f92672">.</span>view(B, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, self<span style="color:#f92672">.</span>num_heads, self<span style="color:#f92672">.</span>d_k)<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># → (B, num_heads, T, d_k)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    scores <span style="color:#f92672">=</span> Q <span style="color:#f92672">@</span> K<span style="color:#f92672">.</span>transpose(<span style="color:#f92672">-</span><span style="color:#ae81ff">2</span>, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>) <span style="color:#f92672">/</span> math<span style="color:#f92672">.</span>sqrt(self<span style="color:#f92672">.</span>d_k)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> mask <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>        scores <span style="color:#f92672">=</span> scores<span style="color:#f92672">.</span>masked_fill(mask <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span>, float(<span style="color:#e6db74">&#34;-inf&#34;</span>))
</span></span><span style="display:flex;"><span>    attn <span style="color:#f92672">=</span> F<span style="color:#f92672">.</span>softmax(scores, dim<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># → (B, num_heads, T, d_k)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    out <span style="color:#f92672">=</span> attn <span style="color:#f92672">@</span> V
</span></span><span style="display:flex;"><span>    out <span style="color:#f92672">=</span> out<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)<span style="color:#f92672">.</span>contiguous()<span style="color:#f92672">.</span>view(B, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, d_model)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># → (B, T, d_model)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>W_o(out)
</span></span></code></pre></div><p>关键 <code>view</code> 和 <code>transpose</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>W_q 后: (B, T, 512)
</span></span><span style="display:flex;"><span>view(B, T, 8, 64) → (B, T, 8, 64)
</span></span><span style="display:flex;"><span>transpose(1, 2) → (B, 8, T, 64)   # 把 &#34;头数&#34; 当作 batch 维并行算
</span></span></code></pre></div><blockquote>
<p><strong><code>view</code> 不分配新内存，不拷贝，不初始化。</strong> 它只是在 W_q 刚算出来的 512 个浮点数上换一个形状标签——把同一块内存重新标记为 8 组 × 64。<code>view</code> 前是 W_q 的有效计算结果，<code>view</code> 后读到的就是那些值，不存在未初始化的随机残值。512 = 8 × 64，严丝合缝，无残留。</p>
</blockquote>
<h2 id="4-positional-encoding--位置从哪来">4. Positional Encoding —— 位置从哪来</h2>
<p>Self-Attention 不分词序：&ldquo;A 打了 B&rdquo; 和 &ldquo;B 打了 A&rdquo; 在纯 Attention 里等价。需要注入位置信息。</p>
<p>Transformer 选 sin/cos 函数，每个维度用不同频率：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
</span></span><span style="display:flex;"><span>PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>维度 0-1：频率最低（接近 DC）—— 编码&#34;这句话有多长&#34;
</span></span><span style="display:flex;"><span>维度 510-511：频率最高 —— 编码&#34;相邻词的关系&#34;
</span></span></code></pre></div><p>可视化——左边是 128 个位置 × 512 维的 sin/cos 栅格，右边是几条维度线随位置的变化：</p>
<p><img alt="PE grid" loading="lazy" src="/tf/pe_grid.png"></p>
<p>代码——<code>model.py:22-30</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>pe <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>zeros(<span style="color:#ae81ff">1</span>, max_len, d_model)
</span></span><span style="display:flex;"><span>pos <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>arange(<span style="color:#ae81ff">0</span>, max_len, dtype<span style="color:#f92672">=</span>torch<span style="color:#f92672">.</span>float)<span style="color:#f92672">.</span>unsqueeze(<span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>div <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>exp(torch<span style="color:#f92672">.</span>arange(<span style="color:#ae81ff">0</span>, d_model, <span style="color:#ae81ff">2</span>)<span style="color:#f92672">.</span>float() <span style="color:#f92672">*</span> (<span style="color:#f92672">-</span>math<span style="color:#f92672">.</span>log(<span style="color:#ae81ff">10000.0</span>) <span style="color:#f92672">/</span> d_model))
</span></span><span style="display:flex;"><span>pe[<span style="color:#ae81ff">0</span>, :, <span style="color:#ae81ff">0</span>::<span style="color:#ae81ff">2</span>] <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>sin(pos <span style="color:#f92672">*</span> div)
</span></span><span style="display:flex;"><span>pe[<span style="color:#ae81ff">0</span>, :, <span style="color:#ae81ff">1</span>::<span style="color:#ae81ff">2</span>] <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>cos(pos <span style="color:#f92672">*</span> div)
</span></span><span style="display:flex;"><span>self<span style="color:#f92672">.</span>register_buffer(<span style="color:#e6db74">&#34;pe&#34;</span>, pe)  <span style="color:#75715e"># 不是可训练参数！持久化但不求梯度</span>
</span></span></code></pre></div><p>为什么选 sin/cos？因为 <code>sin(a+b)</code> 可以用 <code>sin(a)</code> 和 <code>cos(b)</code> 线性组合表示——这让&quot;相对位置&quot;可以被 Attention 学习到，而不只是&quot;绝对位置&quot;。</p>
<h2 id="5-ffn--residual--layernorm">5. FFN + Residual + LayerNorm</h2>
<p>每个 Attention 层后面跟一个 <strong>Position-wise Feed-Forward Network</strong>：<strong>对每个位置独立做相同的两层 MLP</strong>。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>FFN(x) = ReLU(xW1 + b1)W2 + b2
</span></span><span style="display:flex;"><span>              512→2048   2048→512
</span></span></code></pre></div><p>代码——<code>model.py:72-80</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">FFN</span>(nn<span style="color:#f92672">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">__init__</span>(self, d_model, d_ff, dropout<span style="color:#f92672">=</span><span style="color:#ae81ff">0.1</span>):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>fc1 <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Linear(d_model, d_ff)     <span style="color:#75715e"># 升维</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>fc2 <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Linear(d_ff, d_model)     <span style="color:#75715e"># 降回来</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>dropout <span style="color:#f92672">=</span> nn<span style="color:#f92672">.</span>Dropout(dropout)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">forward</span>(self, x):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>fc2(self<span style="color:#f92672">.</span>dropout(F<span style="color:#f92672">.</span>relu(self<span style="color:#f92672">.</span>fc1(x))))
</span></span></code></pre></div><p><strong>Residual Connection</strong> 是 Transformer 能堆到 100 层的核心原因：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>x = x + Sublayer(x)
</span></span></code></pre></div><p>梯度可以通过残差路径 <code>x</code> 直达浅层，不用穿过 Sublayer 的矩阵乘法——这就是&quot;高速公路&quot;。</p>
<p><strong>LayerNorm</strong> 归一化每条样本的特征维度（d_model），消除层间的数值漂移：</p>
$$\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta$$<p>本代码用 <strong>Pre-LN</strong>（现代惯例）——Norm 在前，Sublayer 在后：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>原始论文 Vaswani 2017 (Post-LN):
</span></span><span style="display:flex;"><span>x = LayerNorm(x + Sublayer(x))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>现代实现 (Pre-LN, 更稳定):
</span></span><span style="display:flex;"><span>x = x + Sublayer(LayerNorm(x))
</span></span></code></pre></div><p>区别：Post-LN 的残差梯度必须先穿过 LayerNorm 的归一化再流入浅层——随着层数变深，归一化累积把梯度压得太小，导致深层几乎不更新。Pre-LN 让梯度直接走残差路径，不经过 Norm——这就是为什么 Pre-LN 能训 100 层而 Post-LN 不行。</p>
<p>对应代码——<code>model.py:97-99</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Pre-LN: norm 在里面，残差在外面</span>
</span></span><span style="display:flex;"><span>x <span style="color:#f92672">=</span> x <span style="color:#f92672">+</span> self<span style="color:#f92672">.</span>dropout1(self<span style="color:#f92672">.</span>self_attn(self<span style="color:#f92672">.</span>norm1(x), self<span style="color:#f92672">.</span>norm1(x), self<span style="color:#f92672">.</span>norm1(x), mask))
</span></span><span style="display:flex;"><span>x <span style="color:#f92672">=</span> x <span style="color:#f92672">+</span> self<span style="color:#f92672">.</span>dropout2(self<span style="color:#f92672">.</span>ffn(self<span style="color:#f92672">.</span>norm2(x)))
</span></span></code></pre></div><p>上面把零件拆开讲完了——下面拼起来，看一整句数据怎么穿过 Encoder 和 Decoder。</p>
<h2 id="6-encoder-decoder-全流程">6. Encoder-Decoder 全流程</h2>
<p><img alt="Transformer Architecture" loading="lazy" src="/tf/transformer_full_arch.png">
<em>来源：Wikipedia / &ldquo;Attention Is All You Need&rdquo; (Vaswani et al., 2017)</em></p>
<p>数据流逐阶段拆解：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>                    ┌─ Encoder ─┐              ┌─ Decoder ──────────┐
</span></span><span style="display:flex;"><span>src: &#34;die katze&#34;    Embed+PosEn               Embed+PosEn            tgt: &#34;&lt;sos&gt; the cat&#34;
</span></span><span style="display:flex;"><span>       │              │                            │                      │
</span></span><span style="display:flex;"><span>       ▼              ▼                            ▼                      ▼
</span></span><span style="display:flex;"><span>  [2, 32000]     [2, 10, 512]                [2, 11, 512]          [2, 32000]
</span></span><span style="display:flex;"><span>                     │                            │
</span></span><span style="display:flex;"><span>                     ▼                            ▼
</span></span><span style="display:flex;"><span>              N×EncoderLayer               N×DecoderLayer
</span></span><span style="display:flex;"><span>              ┌ Self-Attn ─┐              ┌ Masked Self-Attn ─┐
</span></span><span style="display:flex;"><span>              │   + FFN    │              │    Cross-Attn     │ ← Q 来自 decoder
</span></span><span style="display:flex;"><span>              │   + FFN    │              │    + FFN          │ ← K/V 来自 encoder
</span></span><span style="display:flex;"><span>              └────────────┘              └───────────────────┘
</span></span><span style="display:flex;"><span>                     │                            │
</span></span><span style="display:flex;"><span>                     ▼                            ▼
</span></span><span style="display:flex;"><span>              Final LayerNorm              Final LayerNorm
</span></span><span style="display:flex;"><span>                     │                            │
</span></span><span style="display:flex;"><span>                     │                            ▼
</span></span><span style="display:flex;"><span>   src_mask──────────┼─→ cross_attn ──→ Linear(d_model → vocab)
</span></span><span style="display:flex;"><span>                     │                            │
</span></span><span style="display:flex;"><span>                                                  ▼
</span></span><span style="display:flex;"><span>                                            [2, 11, 32000]  ← logits
</span></span></code></pre></div><p>每一层的 tensor 形状变化（d_model=512, vocab=32000, 句长 src=10, tgt=11）：</p>
<table>
  <thead>
      <tr>
          <th>步骤</th>
          <th>形状</th>
          <th>说明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Embedding(src)</td>
          <td>(B, 10, 512)</td>
          <td>token → 512 维向量</td>
      </tr>
      <tr>
          <td>×√512 + PositionalEncoding</td>
          <td>(B, 10, 512)</td>
          <td>加位置信息</td>
      </tr>
      <tr>
          <td>EncoderLayer × N</td>
          <td>(B, 10, 512)</td>
          <td>N 层 Self-Attn + FFN</td>
      </tr>
      <tr>
          <td>Final LayerNorm</td>
          <td>(B, 10, 512)</td>
          <td>归一化</td>
      </tr>
      <tr>
          <td>— 交棒 —</td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>Embedding(tgt)</td>
          <td>(B, 11, 512)</td>
          <td>目标端也 embed</td>
      </tr>
      <tr>
          <td>×√512 + PositionalEncoding</td>
          <td>(B, 11, 512)</td>
          <td></td>
      </tr>
      <tr>
          <td>DecoderLayer × N</td>
          <td>(B, 11, 512)</td>
          <td>Self-Attn(masked) + Cross-Attn + FFN</td>
      </tr>
      <tr>
          <td>Final LayerNorm</td>
          <td>(B, 11, 512)</td>
          <td></td>
      </tr>
      <tr>
          <td>Linear(512 → 32000)</td>
          <td>(B, 11, 32000)</td>
          <td>投影到词表</td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>关键洞察：feature 维度全程不变。</strong> 从 Embedding 到 Final LayerNorm，每一层的输出都是 <code>(B, T, d_model)</code>。没有&quot;先压缩到 hidden state 再解开&quot;的过程——这就是 Transformer 和 LSTM 最本质的架构差异。</p>
</blockquote>
<p>对比 LSTM（Phase 1/2 的 Encoder-Decoder）：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>                 LSTM                              Transformer
</span></span><span style="display:flex;"><span>                 ────                              ───────────
</span></span><span style="display:flex;"><span>Encoder:  src → [LSTM × N] → hidden         src → [Self-Attn × N] → enc_out
</span></span><span style="display:flex;"><span>          (B,10,256) → (2, B, 256)          (B,10,512) → (B,10,512)   ← 保持！
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Decoder:  hidden → [LSTM × N] → logits      tgt → [Self/Cross-Attn × N] → logits
</span></span><span style="display:flex;"><span>          (2, B, 256) → (B, 11, 32000)      (B,11,512) → (B,11,512) → (B,11,32000)
</span></span></code></pre></div><p>LSTM 的 <code>hidden</code> 是 <code>(num_layers, B, hidden_size)</code> ——一个<strong>固定大小</strong>的向量。50 个词的语义被压缩到 256 个浮点数里，解码器必须从这个压缩包里逐字拆出译文。</p>
<p>Transformer 的 <code>enc_out</code> 是 <code>(B, S, d_model)</code> ——<strong>按位置展开</strong>的矩阵。解码器的 Cross-Attention 可以直接盯着源句子的每个位置查，不用从压缩包里猜。</p>
<p>这就是&quot;关联&quot;比&quot;压缩&quot;更有效的根本原因。</p>
<p>代码对照——<code>model.py:189-191</code>，整个流程被压缩成三行：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Transformer</span>(nn<span style="color:#f92672">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">forward</span>(self, src, tgt, src_len):
</span></span><span style="display:flex;"><span>        enc_out, src_mask <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>encoder(src, src_len)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>decoder(tgt, enc_out, src_mask)
</span></span></code></pre></div><p>Cross-Attention 的关键（<code>model.py:143-144</code>）——Q 来自 decoder，K/V 来自 encoder：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># decoder 的 cross-attention:</span>
</span></span><span style="display:flex;"><span>x <span style="color:#f92672">=</span> x <span style="color:#f92672">+</span> self<span style="color:#f92672">.</span>dropout2(self<span style="color:#f92672">.</span>cross_attn(
</span></span><span style="display:flex;"><span>    self<span style="color:#f92672">.</span>norm2(x),   <span style="color:#75715e"># ← Q: decoder 的当前状态（&#34;我想翻译出什么&#34;）</span>
</span></span><span style="display:flex;"><span>    enc_out,         <span style="color:#75715e"># ← K: encoder 的输出（&#34;源句子里有什么&#34;）</span>
</span></span><span style="display:flex;"><span>    enc_out,         <span style="color:#75715e"># ← V: encoder 的输出（&#34;源句子的语义&#34;）</span>
</span></span><span style="display:flex;"><span>    src_mask         <span style="color:#75715e"># ← 屏蔽源句子的 padding</span>
</span></span><span style="display:flex;"><span>))
</span></span></code></pre></div><p>以上是推理时的数据流——下面看训练时怎么让模型&quot;学会&quot;翻译。</p>
<h2 id="7-训练三件套">7. 训练三件套</h2>
<p><strong>Teacher Forcing</strong>：训练时不喂自己的预测结果，而是喂<strong>正确答案的上一步</strong>。解码 <code>&quot;&lt;sos&gt; the cat&quot;</code> 时，第一步输入 <code>&lt;sos&gt;</code> 应该输出 <code>the</code>，第二步输入 <code>the</code> 应该输出 <code>cat</code>——但用的是真实的目标序列，不是模型自己生成的。</p>
<p>代码——<code>train_wmt14.py:182</code>：</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>logits <span style="color:#f92672">=</span> model(src, tgt[:, :<span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>], src_len)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># tgt[:, :-1] = &#34;&lt;sos&gt; the cat&#34;  →  期望输出 = &#34;the cat &lt;/s&gt;&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># tgt[:, 1:]  = &#34;the cat &lt;/s&gt;&#34;   →  损失计算目标</span>
</span></span></code></pre></div><p><strong>Label Smoothing</strong>：不要让模型对正确答案有 100% 的确信。把正确答案的概率从 1.0 降为 0.9，其他词瓜分 0.1。这防止模型过度自信，提升泛化。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>criterion <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>nn<span style="color:#f92672">.</span>CrossEntropyLoss(ignore_index<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>, label_smoothing<span style="color:#f92672">=</span><span style="color:#ae81ff">0.1</span>)
</span></span></code></pre></div><p><strong>Warmup Scheduler</strong>：刚开始训练时学习率从 0 线性增长到目标值，然后衰减。前几步太大 → 梯度爆炸；太小 → 收敛慢。Warmup 给出了一个&quot;缓慢启动&quot;的安全缓冲区。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>lr <span style="color:#f92672">=</span> d_model<span style="color:#f92672">^</span>(<span style="color:#f92672">-</span><span style="color:#ae81ff">0.5</span>) <span style="color:#f92672">*</span> min(step_num<span style="color:#f92672">^</span>(<span style="color:#f92672">-</span><span style="color:#ae81ff">0.5</span>), step_num <span style="color:#f92672">*</span> warmup<span style="color:#f92672">^</span>(<span style="color:#f92672">-</span><span style="color:#ae81ff">1.5</span>))
</span></span></code></pre></div><h2 id="8-代码全图modelpy-逐块标注">8. 代码全图：model.py 逐块标注</h2>
<p>前面分模块讲完——这里是<strong>压缩版速查</strong>，按执行流排列，每个模块对应的行号和关键 tensor 流。不是替代前文，是让你回头找代码的时候一眼定位。</p>
<pre tabindex="0"><code>model.py 总览 (191 行)
═══════════════════════════════════════

[L16-35]  PositionalEncoding
  pe[pos, 2i]   = sin(pos / 10000^(2i/d))
  pe[pos, 2i+1] = cos(pos / 10000^(2i/d))
  → register_buffer, 不学

[L41-65]  MultiHeadAttention
  W_q, W_k, W_v, W_o: 四个独立的线性投影
  view+transpose: (B, T, 512) → (B, 8, T, 64)
  Q·K^T / √64 → softmax → ×V → concat → W_o

[L72-80]  FFN
  fc1: 512→2048 (ReLU) → dropout → fc2: 2048→512
  每个位置独立，共享参数

[L87-100] EncoderLayer
  norm1 → Self-Attention → +residual (dropout)
  norm2 → FFN          → +residual (dropout)

[L103-120] Encoder
  Embedding → ×√d_model → +PositionalEncoding
  → EncoderLayer × N → Final LayerNorm
  → return (enc_out, src_mask)

[L127-147] DecoderLayer
  norm1 → Masked Self-Attention      → +residual
  norm2 → Cross-Attention(Q=dec,KV=enc) → +residual
  norm3 → FFN                         → +residual

[L150-176] Decoder
  Embedding → ×√d_model → +PositionalEncoding
  → DecoderLayer × N → Final LayerNorm
  → Linear(512→32000) → logits

[L183-191] Transformer
  enc_out, src_mask = encoder(src, src_len)
  return decoder(tgt, enc_out, src_mask)
</code></pre><h2 id="9-从-nmt-到-gpt">9. 从 NMT 到 GPT</h2>
<p>Encoder-Decoder 架构适合机器翻译——有明确的&quot;源&quot;和&quot;目标&quot;。但 LLM（GPT 系列）只用 Decoder。</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>Encoder-Decoder（翻译）:
</span></span><span style="display:flex;"><span>src → [Encoder: 全连接 Attention] → enc_out
</span></span><span style="display:flex;"><span>tgt → [Decoder: masked Self-Attn + Cross-Attn(enc)] → logits
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Decoder-only（GPT）:
</span></span><span style="display:flex;"><span>input → [Decoder: masked Self-Attention] → logits
</span></span><span style="display:flex;"><span>        ↑
</span></span><span style="display:flex;"><span>        causal mask 就是全部
</span></span><span style="display:flex;"><span>        （Cross-Attention 删了，因为没有 encoder）
</span></span></code></pre></div><p>只需要删掉 Cross-Attention，去掉 Encoder——Transformer 就退化成了 GPT。解码时逐 token 生成，每次用 <code>causal mask</code> 让当前位置后面的 token 不可见。</p>
<p>这就是 Transformer 的故事：从针对翻译的 Encoder-Decoder 设计，到成为所有 LLM 的通用骨架。不是因为&quot;翻译&quot;这个任务特殊——是因为 <strong>Self-Attention + Residual + LayerNorm</strong> 这三个东西的组合，恰好构成了一个能稳定堆叠到任意深度的通用序列处理器。</p>
<blockquote>
<p><strong>三句话带走：</strong></p>
<ol>
<li>RNN 是单管滴管——梯度 O(n)、信息逐个挤过一根 hidden state 管；Transformer 是多通道移液器——O(1) 路径、全并行吸入一整行。</li>
<li>发动机是 <strong>Self-Attention</strong>（Q·K^T/√d × softmax × V），拼装上 <strong>Multi-Head</strong>、<strong>Pre-LN Residual</strong>、<strong>sin/cos Positional Encoding</strong>。</li>
<li>Encoder-Decoder 是翻译特化版——删掉 Cross-Attn 和 Encoder 就是 GPT。</li>
</ol>
</blockquote>
<hr>
<blockquote>
<p><strong>License: GPLv3</strong>
本文《Transformer 从零到一》系列采用 GNU 通用公共许可证第三版 (GNU General Public License v3.0) 协议进行开源发布与分发。允许任何形式的复制、修改和分发，但必须继承相同的开源协议，承认在算力宇宙中所有的迭代与变异。</p>
</blockquote>
]]></content:encoded>
    </item>
  </channel>
</rss>
