<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Distributed Training on Yan Bai</title><link>https://iseekyan.github.io/categories/distributed-training/</link><description>Recent content in Distributed Training on Yan Bai</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 18 May 2026 21:50:00 +0800</lastBuildDate><atom:link href="https://iseekyan.github.io/categories/distributed-training/index.xml" rel="self" type="application/rss+xml"/><item><title>Making Long-Context MoE RL Training Easier to Tune: Optimization Practice in Megatron-Lite / bumblebee</title><link>https://iseekyan.github.io/posts/qwen35-long-sequence-moe-rl/</link><pubDate>Mon, 18 May 2026 21:50:00 +0800</pubDate><guid>https://iseekyan.github.io/posts/qwen35-long-sequence-moe-rl/</guid><description>&lt;p>Long-context training is no longer a pretraining-only problem. In RL, prompts, rollouts, rewards, tool calls, and multi-turn interactions all push context length upward. The question many teams care about has also changed: can they start with a modest number of GPUs and reliably run 128K, 256K, or even longer-context algorithm experiments?&lt;/p>
&lt;p>RL uses resources differently from pretraining. Pretraining usually optimizes for large-scale, stable, long-running throughput. RL often needs many experiments in parallel. When users have more GPUs, they usually want to run several reward, rollout, data-mixture, or hyperparameter experiments at the same time, instead of putting every GPU into one training job. A long-context RL training system first needs a low-friction, reusable starting point with few tuning requirements.&lt;/p></description></item><item><title>FSDP, PP, CP, and EP: Four Parallel Dimensions in Large-Scale Training</title><link>https://iseekyan.github.io/posts/fsdp-pp-cp-ep/</link><pubDate>Sun, 17 May 2026 01:10:00 +0800</pubDate><guid>https://iseekyan.github.io/posts/fsdp-pp-cp-ep/</guid><description>&lt;p>This is a sample English post for testing bilingual navigation and comments.&lt;/p>
&lt;p>Large-scale model training is usually not solved by one parallelism strategy alone. Instead, several dimensions are composed:&lt;/p>
&lt;ul>
&lt;li>FSDP shards parameters, gradients, and optimizer states to reduce model-state memory.&lt;/li>
&lt;li>Pipeline Parallelism splits model depth across devices.&lt;/li>
&lt;li>Context Parallelism shards long sequences and mainly reduces attention activation pressure.&lt;/li>
&lt;li>Expert Parallelism distributes MoE experts across devices, which is useful for models such as Qwen, DeepSeek, and Mixtral.&lt;/li>
&lt;/ul>
&lt;p>You can replace this sample with a real article later.&lt;/p></description></item></channel></rss>