Technical Notes

Notes on distributed training, RL systems, engineering practice, and papers.

Making Long-Context MoE RL Training Easier to Tune: Optimization Practice in Megatron-Lite / bumblebee

Long-context training is no longer a pretraining-only problem. In RL, prompts, rollouts, rewards, tool calls, and multi-turn interactions all push context length upward. The question many teams care about has also changed: can they start with a modest number of GPUs and reliably run 128K, 256K, or even longer-context algorithm experiments? RL uses resources differently from pretraining. Pretraining usually optimizes for large-scale, stable, long-running throughput. RL often needs many experiments in parallel. When users have more GPUs, they usually want to run several reward, rollout, data-mixture, or hyperparameter experiments at the same time, instead of putting every GPU into one training job. A long-context RL training system first needs a low-friction, reusable starting point with few tuning requirements. ...

FSDP, PP, CP, and EP: Four Parallel Dimensions in Large-Scale Training

This is a sample English post for testing bilingual navigation and comments. Large-scale model training is usually not solved by one parallelism strategy alone. Instead, several dimensions are composed: FSDP shards parameters, gradients, and optimizer states to reduce model-state memory. Pipeline Parallelism splits model depth across devices. Context Parallelism shards long sequences and mainly reduces attention activation pressure. Expert Parallelism distributes MoE experts across devices, which is useful for models such as Qwen, DeepSeek, and Mixtral. You can replace this sample with a real article later. ...