<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Moe on Yan Bai</title><link>https://iseekyan.github.io/tags/moe/</link><description>Recent content in Moe on Yan Bai</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 18 May 2026 21:50:00 +0800</lastBuildDate><atom:link href="https://iseekyan.github.io/tags/moe/index.xml" rel="self" type="application/rss+xml"/><item><title>Making Long-Context MoE RL Training Easier to Tune: Optimization Practice in Megatron-Lite / bumblebee</title><link>https://iseekyan.github.io/posts/qwen35-long-sequence-moe-rl/</link><pubDate>Mon, 18 May 2026 21:50:00 +0800</pubDate><guid>https://iseekyan.github.io/posts/qwen35-long-sequence-moe-rl/</guid><description>&lt;p>Long-context training is no longer a pretraining-only problem. In RL, prompts, rollouts, rewards, tool calls, and multi-turn interactions all push context length upward. The question many teams care about has also changed: can they start with a modest number of GPUs and reliably run 128K, 256K, or even longer-context algorithm experiments?&lt;/p>
&lt;p>RL uses resources differently from pretraining. Pretraining usually optimizes for large-scale, stable, long-running throughput. RL often needs many experiments in parallel. When users have more GPUs, they usually want to run several reward, rollout, data-mixture, or hyperparameter experiments at the same time, instead of putting every GPU into one training job. A long-context RL training system first needs a low-friction, reusable starting point with few tuning requirements.&lt;/p></description></item></channel></rss>