Phase 3 L2 Architecture Decision: Robust-TF vs RWKV-Lite

In our attempt to train the Massive L2 14M dataset, we encountered catastrophic model divergence by Epoch 3/4. The model was producing repetitive garbage, and loss=nan crashed the training.

Experiment A: Robust-TF (Standard Transformer with AMP fixes)

We tested PyTorch’s AMP behavior with nan and inf gradients. We found that scaler.step() natively protects against inf/nan gradients by skipping the step. However, using clip_grad_norm_ on inf gradients turns them into nan. While scaler.step() still skips them, we implemented a custom check in spr_massive_l2_train.py to explicitly detect and log these anomalies before clipping, skipping the batch entirely.

The divergence was likely caused by finite but pathologically large gradients from long-tail data combined with a high learning rate (3e-4) without warmup, rather than raw NaN injection.

Experiment B: RWKV-Lite

We implemented a minimal, pure PyTorch RWKV-Lite decoder ($O(N)$) and successfully wired the frozen L1 manifold into its initial state ($h_0$). In our poisoning tests, RWKV demonstrated stability and avoided O($N^2$) softmax blowout.

Decision

We select Robust-TF (Exp-A2) for the final 14M Massive run. Why? The standard Transformer is already fully integrated into our inference (Beam Search) and training pipelines. By utilizing our newly patched spr_massive_l2_train.py with explicit anomaly detection and adjusting our training hyperparameters (e.g., lower learning rate, warmup), we can tame the divergence. RWKV, while elegant, would require a complete rewrite of the inference engine.

Phase 3 L2 Architecture Decision: Robust-TF vs RWKV-Lite#

Experiment A: Robust-TF (Standard Transformer with AMP fixes)#

Experiment B: RWKV-Lite#

Decision#

Phase 3 L2 Architecture Decision: Robust-TF vs RWKV-Lite

Experiment A: Robust-TF (Standard Transformer with AMP fixes)

Experiment B: RWKV-Lite

Decision