Phase 3 L2 Architecture Decision: Robust-TF vs RWKV-Lite
In our attempt to train the Massive L2 14M dataset, we encountered catastrophic model divergence by Epoch 3/4. The model was producing repetitive garbage, and loss=nan crashed the training.
Experiment A: Robust-TF (Standard Transformer with AMP fixes)
We tested PyTorch’s AMP behavior with nan and inf gradients. We found that scaler.step() natively protects against inf/nan gradients by skipping the step. However, using clip_grad_norm_ on inf gradients turns them into nan. While scaler.step() still skips them, we implemented a custom check in spr_massive_l2_train.py to explicitly detect and log these anomalies before clipping, skipping the batch entirely.
The divergence was likely caused by finite but pathologically large gradients from long-tail data combined with a high learning rate (3e-4) without warmup, rather than raw NaN injection.
Experiment B: RWKV-Lite
We implemented a minimal, pure PyTorch RWKV-Lite decoder ($O(N)$) and successfully wired the frozen L1 manifold into its initial state ($h_0$). In our poisoning tests, RWKV demonstrated stability and avoided O($N^2$) softmax blowout.
Decision
We select Robust-TF (Exp-A2) for the final 14M Massive run.
Why? The standard Transformer is already fully integrated into our inference (Beam Search) and training pipelines. By utilizing our newly patched spr_massive_l2_train.py with explicit anomaly detection and adjusting our training hyperparameters (e.g., lower learning rate, warmup), we can tame the divergence. RWKV, while elegant, would require a complete rewrite of the inference engine.