[{"content":"Long-context training is no longer a pretraining-only problem. In RL, prompts, rollouts, rewards, tool calls, and multi-turn interactions all push context length upward. The question many teams care about has also changed: can they start with a modest number of GPUs and reliably run 128K, 256K, or even longer-context algorithm experiments?\nRL uses resources differently from pretraining. Pretraining usually optimizes for large-scale, stable, long-running throughput. RL often needs many experiments in parallel. When users have more GPUs, they usually want to run several reward, rollout, data-mixture, or hyperparameter experiments at the same time, instead of putting every GPU into one training job. A long-context RL training system first needs a low-friction, reusable starting point with few tuning requirements.\nThis article uses Qwen3.5-35B-A3B / 32 x H100 / 128K as the running case. It represents a common request from long-context MoE users: can a 35B-class model run quickly on a few dozen GPUs, and can the resulting setup be simple enough for later algorithm work?\nThe article first walks through how this case moves from a Megatron-style baseline to a smaller recipe. Then it steps back to the engineering side. These optimizations can also be implemented in Megatron, but integrating 5D parallelism, FSDP2, full recompute, and chunked EP overlap has a large implementation surface. bumblebee\u0026rsquo;s primitive + Skill design breaks the work into smaller pieces that are easier for humans and agents to collaborate on.\n1. The problem: long-context MoE RL is tightly coupled 1.1 RL long-context training has different resource goals 35B / 32 x H100 / 128K looks like one training configuration. In practice, it pulls in a set of tightly coupled tuning choices:\nTuning choice Example values Main coupling points PP 1 / 2 / 4 / 8 Pipeline bubble, stage partitioning, batch shape TP 1 / 2 / 4 / 8 Sequence parallelism, collective communication, vocab partitioning EP 1 / 4 / 8 / 16 / 32 Number of experts, token dispatch, expert GEMM shape CP 1 / 2 / 4 / 8 Per-GPU sequence length, attention communication, KV head / group constraints recompute none / selective / full Activation memory and repeated compute offload off / selective Optimizer throughput and host-device bandwidth This article does not treat local sequence length as an independent tuning choice. It is the result of global sequence length, CP partitioning, TP/EP layout, and batch shape. Whether it lands at 16K, 32K, or 64K directly affects activations, attention, MoE buffers, kernel shapes, and CPU overhead.\nlocal seq is closely tied to compute efficiency. When the per-GPU sequence is too short, compute is fragmented, and kernel launch, runtime submit, MoE dispatch scheduling, and CPU-side overhead become more visible. With a longer per-GPU sequence, each GPU receives a larger contiguous block of work. Many fixed scheduling costs are amortized, and expert GEMM and attention are more likely to run in a stable regime. Long-context RL tuning is often about finding a reusable balance between memory pressure, local seq, exposed communication, and compute efficiency.\n1.2 Failed configurations usually come from several factors at once When a configuration OOMs, several low-level factors are usually pushed up together:\nStatic memory: parameters, gradients, optimizer states, all-gather buffers; Dynamic memory: activations, MoE buffers, logits / loss intermediates; Compute efficiency: local seq, expert token count, batch shape; Exposed communication: attention / CP communication, MoE dispatch / combine, parameter all-gather / reshard; CPU overhead: small kernels, small batch slices, frequent launches and scheduling. A configuration can also run but be slow. That often means dynamic memory has been addressed, while compute efficiency or exposed communication has become the bottleneck. This is the hard part in long-context RL: changing one parallelism option also changes memory, local seq, communication patterns, and kernel shapes. The optimizations below split the problem along these underlying factors.\n1.3 The real cost of a Megatron-style all-in-one system Megatron is a capable training system. It supports many models, parallel strategies, optimizers, checkpointing paths, recompute modes, kernel fusions, and legacy configuration paths. Many of the optimizations discussed here can be implemented in Megatron, and the lower-level capabilities often come from the Megatron and Transformer Engine ecosystem.\nThe cost mainly comes from the cognitive load of the all-in-one design. Model structure, parallel state, optimizer/checkpoint semantics, recompute scheduling, performance flags, and legacy compatibility paths are intertwined in one large system. When adding a feature or debugging a performance problem, a developer needs to understand a long call chain and a wide configuration surface. For an agent, that large context also makes task decomposition and validation harder.\nTwo examples from this article:\nIntegrating FSDP2 into Megatron 5D parallelism is far more than swapping an optimizer. It has to handle parameter sharding, grad reduce-scatter, optimizer state, PP/TP/EP/CP process groups, checkpoint resharding, parameter materialization under full recompute, and semantic alignment with the existing distributed optimizer. chunked EP overlap is also much more than splitting tokens into a few chunks. Under full recompute, the dependencies among forward, recompute forward, backward, delayed wgrad, dispatch, and combine all need to be redesigned. Too little change may produce no overlap; too much can break numerics or scheduling semantics. This work can be done in Megatron, but the implementation and validation surface is large. The next section walks through the optimization path first. The final section returns to bumblebee\u0026rsquo;s design and explains why primitive boundaries make these complex changes easier to split into incremental development and agent collaboration tasks.\n2. From baseline to a smaller recipe 2.1 Baseline: the usable point is already expensive In a conventional Megatron-style tuning space, a representative usable high point for 35B / 32 x H100 / 128K is:\nModel Baseline config TFLOPs/GPU Peak memory Qwen3.5-35B-A3B TP=2, PP=4, EP=8, CP=2, full recompute 127.53 42.91 GB This point already went through PP, EP, CP, and recompute filtering. The conventional path works, but continuing to move within the same high-dimensional space has poor marginal returns and high trial cost. To make the recipe easier to reuse, one important step is to decouple static memory from PP/EP-style parallel layouts.\n2.2 Path choice: accept full recompute and increase local seq The first problem in long-context MoE training is memory, but memory immediately pulls in compute efficiency, exposed communication, and CPU overhead. There are at least two reasonable ways to fight OOM.\nPath A: less recompute, smaller per-GPU sequence The first path avoids recompute as much as possible, or uses only a small amount of selective recompute. The benefit is direct: less repeated computation per step.\nThe cost is just as clear: activations remain in memory, so the memory budget is tighter. To fit a 128K global sequence, the per-GPU sequence often has to shrink, which means increasing CP. CP can partition the sequence dimension, but it introduces attention-side collective communication and is constrained by model structure such as KV heads / groups. For models like Qwen3.5, CP cannot grow without limit.\nThis path can still be optimized. For example, pipeline windows from adjacent microbatches can cover part of the CP communication. But it creates a chain reaction: fitting activations requires smaller local seq; smaller local seq makes compute efficiency and CPU overhead more sensitive; larger CP adds more attention-side communication and scheduling constraints.\nPath B: accept recompute and increase per-GPU sequence The second path accepts more aggressive recompute. It reduces resident activation memory and increases local seq.\nA larger local seq helps in several ways:\nEach GPU processes a longer contiguous block of work, so CPU launch / dispatch overhead is easier to amortize; CP has more headroom and does not have to hit model-structure limits immediately; expert GEMM, attention, and runtime submit are more likely to run in a stable regime; small overheads that would otherwise need careful tuning are covered by larger compute blocks. This path also has a cost. full recompute repeats part of the forward computation. In our training accounting, that is roughly 30% extra compute. Scheduling cannot remove that cost; it can only be offset by larger compute blocks and less exposed system overhead. MoE communication also does not disappear just because recompute is enabled.\nThis article chooses Path B. Long-context RL needs a good default starting point. If dynamic memory and CPU overhead are handled first, users do not have to blindly sweep PP/TP/EP/CP/recompute/offload from the beginning. For peak performance, the trade-off between less recompute and more recompute can still be tuned later.\nLow-level factor Path A: smaller local seq Path B: larger local seq Dynamic memory More activations kept resident Lower by default Static memory Still needs handling Still needs handling Compute efficiency Small local seq fragments work Overheads are easier to amortize Exposed communication Higher CP / attention pressure MoE communication still needs handling CPU overhead More visible Less visible Following Path B still leaves two tasks: static memory needs to be decoupled from PP/EP-style parallel choices, and MoE communication needs to be hidden inside compute windows as much as possible.\n2.3 Memory path: handle activations, logits, and static state separately Step 0: conventional parallelism and recompute reach 120+ Along Path B, conventional parallelism plus full recompute first gets 35B / 32 x H100 / 128K running:\nTP=2, PP=4, EP=8, CP=2, full recompute 127.53 TFLOPs/GPU, peak 42.91 GB This point shows that the basic path is viable, and also exposes the problem: even with distributed optimizer, static memory still stacks with the dynamic peak. If memory is handled only through PP/EP/CP, tuning falls back into the same coupled space.\nStep 1: use linear CE to remove the logits peak For long sequence length and large vocab, materializing full logits before cross-entropy can make the loss-side temporary tensors a major part of peak memory.\nTake the Qwen3.5 text vocab size of 248,320 as an example. In the baseline with TP=2, each vocab-parallel rank has about 124,160 vocab columns. With CP=2, each rank has 131,072 / 2 = 65,536 local tokens. If fp32 logits are materialized before loss:\nlocal_tokens * partition_vocab * 4 bytes = 65,536 * 124,160 * 4 = 32.55 GB decimal ~= 30.32 GiB Without vocab partitioning by TP, the same local token count would produce about 65.10 GB decimal. This scale makes logits / loss intermediates a primary peak-memory term.\nThe fix is linear CE: stream along the vocab dimension to compute loss and backward without keeping full logits in memory. This optimization has a cost: the loss layer itself becomes slower. In our measurements, loss-layer time is about 1.3x the original. The benefit is a much lower long-context memory peak. With this type of loss-side optimization, a representative 35B training point moves into the roughly 160 TFLOPs/GPU range:\nConfig TFLOPs/GPU Peak memory TP=1, PP=2, EP=8, CP=4, full recompute 162.07 55.91 GB The point here is to explain why the logits peak around loss must be handled separately. The numbers should not be read as a strict single-toggle on/off ablation. Without this fix, long-context dynamic memory repeatedly pushes configurations back to the edge of OOM.\nStep 2: use FSDP2 for static state linear CE handles the loss-side dynamic peak. Parameters, gradients, and optimizer states remain a harder static-memory problem.\nThe distributed optimizer in the baseline already helps, but it does not remove static memory pressure. It is closer to ZeRO-1 semantics: optimizer states are sharded, while model parameters and gradients still reside according to PP/TP/EP parallel layouts. At 128K sequence length, that static memory stacks with activations, logits, MoE buffers, and all-gather buffers at peak.\nFrom a training-state sharding perspective, Megatron distributed optimizer and FSDP2 can be understood through ZeRO semantics:\ndistributed optimizer is closer to ZeRO-1: optimizer states are sharded along the data-parallel dimension, but parameters and gradients still follow model-parallel layouts; FSDP2 is closer to ZeRO-3: parameters, gradients, and optimizer states are globally sharded, and parameters are all-gathered when the current FSDP unit computes. If nGPU is the total number of GPUs participating in training and n_param is the total number of model parameters, optimizer state under distributed optimizer can be roughly estimated as:\noptimizer state per GPU ~= n_param * 12 / nGPU The part that still needs to shrink is parameters and gradients. Since Qwen3.5 MoE parameters dominate, parameter and gradient memory under distributed optimizer is mainly tied to PP and EP:\ndist_opt parameter + grad per GPU ~= n_param * 6 / (PP * EP) FSDP2 aims to push parameters and gradients toward global sharding as well:\nFSDP2 parameter + grad per GPU ~= n_param * 6 / nGPU Static memory item distributed optimizer / ZeRO-1 view FSDP2 / ZeRO-3 target view optimizer state n_param * 12 / nGPU n_param * 12 / nGPU parameter + grad n_param * 6 / (PP * EP) n_param * 6 / nGPU Main difference Parameters and gradients remain tied to PP/EP-style model-parallel partitioning Parameters and gradients are also globally sharded Cost No extra parameter all-gather / grad reduce-scatter Requires parameter all-gather and grad reduce-scatter In the same main parallel setting, 35B, 32 x H100, 128K, PP=2, EP=8, CP=4, full recompute, FSDP2 lowers peak by about 8.88 GB with essentially no throughput loss:\nbackend optimizer semantics TFLOPs/GPU Peak memory distributed optimizer approx. ZeRO-1 162.07 55.91 GB FSDP2 approx. ZeRO-3 163.06 47.03 GB After parameters and gradients are sharded further, static-memory headroom increases noticeably. PP and EP no longer need to act as the default static-memory-saving tools. Their roles become cleaner:\nTuning choice Main role after FSDP2 CP Tune local seq so the per-GPU sequence lands in a good memory and compute-efficiency range EP Tune MoE compute and communication efficiency, avoiding cross-node all-to-all degradation PP Use for peak-performance or peak-memory tuning when needed, not as the default static-memory tool With this in mind, PP=1, EP=8, CP=4 becomes worth trying. In pure FSDP2 no-chunk mode, the historical high point for PP=1 is about 180.18 TFLOPs/GPU with 60.54 GB peak. After adding chunked EP overlap later, the same main configuration moves into the 37-38 GB peak range and approaches 190 TFLOPs/GPU.\nAt this point, the memory cost along Path B has been split into three parts: activations through full recompute, logits through linear CE, and static state through FSDP2. The remaining major problem is exposed MoE all-to-all communication.\n2.4 Communication path: chunked EP overlap In long-context MoE, compute is already dense, but expert-parallel token dispatch / combine all-to-all can still be exposed on the timeline.\nThe direct idea is 1F1B overlap: use compute windows from adjacent pipeline microbatches to cover dispatch / combine. But the memory path in this article uses full recompute by default, and the old 1F1B overlap path does not directly support it. Falling back to a low-recompute path just to reuse it would bring back the dynamic-memory, local seq, and CPU-overhead issues avoided earlier.\nSo we redesigned MoE EP A2A overlap to support full recompute. The mechanism is straightforward: split tokens into multiple chunks, execute communication and compute streams with an offset, and place dispatch / combine for neighboring chunks inside expert-compute windows when possible. Communication that was serialized and exposed can then be covered by adjacent chunk compute.\nTwo terms will appear repeatedly:\nno-chunk: one MoE layer dispatches all tokens in the current microbatch to experts at once, runs grouped GEMM, then combines everything back at once. The timeline can be simplified as dispatch -\u0026gt; grouped GEMM -\u0026gt; combine. chunk2: split the same MoE layer input into two token chunks. Each chunk has its own dispatch, expert compute, and combine. The scheduler offsets communication and compute streams so that combine from the previous chunk or dispatch from the next chunk can land inside a neighboring expert-compute window. Larger chunk counts, such as chunk3 or chunk4, can also be tried. More chunks are not automatically better. Too many chunks increase kernel inflation, runtime submit overhead, and scheduling overhead. They may also make expert GEMMs too small and reduce compute efficiency. This article focuses on the basic mechanism from no-chunk to chunk2, and why overlap becomes more valuable at longer sequence lengths.\nKernel schedule: the gain comes from real windows The forward no-chunk schedule can be simplified as:\ndispatch -\u0026gt; grouped GEMM -\u0026gt; combine With chunk2, the communication stream executes c0 dispatch / c1 dispatch / c0 combine / c1 combine in order, while the compute stream executes c0 GEMM / c1 GEMM in order. The dependencies are:\nc0 dispatch -\u0026gt; c0 GEMM -\u0026gt; c0 combine c1 dispatch -\u0026gt; c1 GEMM -\u0026gt; c1 combine On the key rank for 16K forward, the no-chunk measured total is 10.36 ms; chunk2 is 7.88 ms. This should not be interpreted as \u0026ldquo;splitting into two chunks hides half the communication.\u0026rdquo; In the actual trace, both c0 dispatch / c1 dispatch happen before the expert window. The visible forward overlap is c0 combine being pushed into c1 GEMM; c1 combine still forms the tail.\nBackward runs in the opposite semantic order: first combine-bwd, then expert backward compute, then dispatch-bwd. In the corresponding 16K backward window, no-chunk is 13.18 ms and chunk2 is 11.77 ms. The compute cannot be drawn as one large \u0026ldquo;expert backward\u0026rdquo; block. It needs to be split into dgrad and delayed wgrad. delayed wgrad must run after the last chunk\u0026rsquo;s dgrad, and it can overlap with the tail dispatch-bwd.\nWhen recompute forward and backward are fused, the gain also comes from less repeated compute. The fused path lets backward directly consume recompute expert output. Part of combine in recompute forward can be removed, and the recompute phase no longer needs to run the fc2 that would normally produce the MoE layer output. This removes a tail section of recompute forward that existed only to prepare intermediate results for backward.\nThis requires reorganizing forward / backward dependencies. The change is about where data is produced and consumed: recompute forward keeps only the expert intermediates backward actually needs, while combine and fc2 are either consumed by the fused backward path or skipped. This improves the alignment between communication and compute windows and reduces part of the repeated work in the full-recompute path.\nLonger sequence, larger overlap gain To isolate the local trend of MoE EP A2A overlap, we first look at a 35B-A3B sparse MoE single-layer proxy. This proxy covers forward + backward for one sparse MoE layer. It does not include attention, dense MLP, optimizer, or other transformer layers, so it should not be read as end-to-end train-step speedup. It is mainly used to explain how overlap changes with sequence length.\nseq no-chunk step ms optimized step ms speedup 4K 11.396 10.503 +7.84% 8K 15.988 14.625 +8.52% 16K 28.457 24.798 +12.86% 32K 54.412 44.375 +18.45% 64K 107.988 82.105 +23.97% The trend is stable: the longer the sequence, the larger the communication window and the larger the gain from chunked EP overlap. Chunk count, EP size, and asymmetric chunk weights still need care. Too many chunks add kernel inflation and runtime submit overhead; too large an EP size may lose the gain to cross-node communication. The practical conclusion is specific: in long-context MoE, EP A2A needs scheduling-level treatment, and the chunk policy still depends on sequence length, EP size, and kernel shape.\n2.5 Combined result: from 4D search to a recipe After the memory path and communication path are both in place, the setup becomes viable:\nOOM risk is reduced: memory is handled separately by full recompute, linear CE, and FSDP2; Performance is high enough: exposed MoE all-to-all is handled by chunked EP overlap. The memory path answers whether the model can run. The communication path decides whether it runs fast enough. Together, they give a usable default point for long-context RL.\nThe 35B path can be read in the actual tuning order:\nStage Main config What changed Problem addressed Representative result Megatron baseline TP=2, PP=4, EP=8, CP=2 Conventional parallelism and full recompute tuning First usable high point 127.53 TFLOPs/GPU, 42.91 GB linear CE TP=1, PP=2, EP=8, CP=4 Avoid materializing full logits; loss layer about 1.3x slower Loss-side dynamic peak 162.07 TFLOPs/GPU, 55.91 GB FSDP2 TP=1, PP=2, EP=8, CP=4 Keep main parallelism unchanged; replace optimizer primitive Globally shard static state 163.06 TFLOPs/GPU, 47.03 GB better parallel TP=1, PP=1, EP=8, CP=4 Use static-memory headroom to reduce PP bubble Make PP=1 a usable high point 180.18 TFLOPs/GPU, 60.54 GB chunked EP TP=1, PP=1, EP=8, CP=4 Shorten MoE communication and buffer lifetime Handle exposed EP all-to-all 185.96-187.10 TFLOPs/GPU, 37-38 GB The table is mainly about tuning order: pin down the baseline; use linear CE for logits; show that FSDP2 lowers static memory under the same main parallelism; use the static-memory headroom from FSDP2 to choose a better PP=1, EP=8, CP=4; then add chunked EP overlap to address the exposed communication in nsys traces.\nOne detail matters: the 37-38 GB peak comes from FSDP2 plus chunked EP. It is not the memory result of pure FSDP2 no-chunk.\nIn the old space, users have to tune PP/TP/EP/CP together, and every choice affects memory, communication, and kernel shapes. The goal of this path is to reduce the default tuning space into a smaller recipe:\nTuning choice Default When to tune EP Use 8 for models \u0026lt;=100B; use 32 for \u0026gt;100B Choose by model size first CP Keep per-GPU seq around 16K-32K Derive from global sequence length PP Optional 2 Use for peak performance, lower peak memory, or higher local seq limit TP Remove by default Use only with a clear model or hardware reason The change can be summarized in three points:\nTP is removed from the default recipe; PP and EP are no longer treated as the main static-memory tools; FSDP2 handles static memory independently; The default recipe first chooses EP by model size, then chooses CP by target per-GPU sequence length; PP is tuned only when chasing peak performance. This figure is a summary, not a measured cell. It maps back to the low-level factors from the beginning: dynamic memory, CPU overhead, static memory, compute efficiency, and exposed communication each need a default treatment. Once those are covered separately, the recipe does not become another large set of tuning choices.\n3. Why bumblebee fits this kind of optimization and agent collaboration 3.1 Megatron can do it too; the integration cost differs linear CE, FSDP2, and chunked EP overlap can all be moved into heavier training frameworks. The underlying kernels and parallel semantics also rely heavily on the Megatron / Transformer Engine ecosystem.\nThe main difference is integration and validation cost.\nIntegrating FSDP2 into Megatron 5D parallelism means dealing with PP/TP/EP/CP/DP process groups, parameter all-gather, gradient reduce-scatter, optimizer state, distributed checkpointing, recompute, MoE expert parameters, embedding/head sharing paths, and existing distributed optimizer semantics. Any mismatch can show up as a loss, grad norm, peak memory, or step-time anomaly.\nchunked EP overlap is similar. It changes the timing relationship among MoE dispatch/combine and expert compute. Under full recompute, it also has to account for recompute forward / backward dependencies, delayed wgrad placement, buffer lifetime for different chunks, and synchronization points between communication and compute streams. It may look like a local schedule optimization, but it touches the full forward/backward semantics of the MoE primitive.\nThese complexities do not disappear under a new project name. bumblebee reduces the change boundary so developers and agents can focus on the current primitive instead of loading the whole all-in-one framework at once.\n3.2 Primitive boundaries reduce the change surface bumblebee is the internal codename for Megatron-Lite. It can also be seen as the library form that grew out of this lightweight training implementation. A model is composed from primitives rather than inheriting from a constantly expanding monolithic model class. Runtime owns the training protocol, Primitive owns replaceable capabilities, and Model chooses and composes them.\nFor the three optimizations in this article, the mapping is clear:\nlinear CE replaces a loss primitive; FSDP2 replaces an optimizer / state-sharding primitive; chunked EP overlap replaces a MoE communication primitive. Each replacement has relatively clear inputs, outputs, and validation criteria. Models gain capabilities by composing primitives, and optimizations can land along primitive boundaries.\n3.3 Paired baselines make local optimization easier to validate bumblebee keeps paired baselines. The old primitive and new primitive are interchangeable. With the same model, same input, and same parallel configuration, we can compare peak memory, step time, loss, logits, grad norm, and key intermediate tensors.\nThat matters for complex optimizations. FSDP2 changes the lifetime of parameters and optimizer state. chunked EP changes the schedule of communication and compute. Both can affect performance, memory, and numerics at the same time. A paired baseline gives each replacement a clear reference point. Developers do not have to guess where a change came from inside the full training stack, and agents can work in a small loop: read the Skill, modify the primitive, run paired tests, report the delta.\n3.4 Agent-native: small boundaries, Skills, and incremental tasks An agent has a hard time understanding a 30K-line all-in-one framework in one shot. It is much better at tasks with clear boundaries, explicit inputs and outputs, and fixed validation methods.\nbumblebee\u0026rsquo;s agent-native design maps to a concrete workflow:\nEach primitive can have a Skill explaining what it does, how to compose it, how to validate it, and what trade-offs it has; Each development task only needs to handle one local replacement, then validate it against a paired baseline; Models can grow incrementally instead of supporting every feature on day one. This is where bumblebee works well with agents. It does not make FSDP2 or chunked EP theoretically easier. It makes the engineering work smaller, more testable, and easier to roll back.\n4. Summary and next steps Long-context MoE RL training is hard to tune for concrete reasons: PP/TP/EP/CP/recompute/offload all change memory, communication, compute efficiency, and CPU overhead at the same time. The path in this article first accepts full recompute and increases local seq, using larger compute blocks to reduce the complexity of dynamic memory and host overhead. It then uses linear CE for the logits peak, FSDP2 for static state, and chunked EP overlap for exposed MoE all-to-all.\nThe default space users face is reduced from a 4D PP/TP/EP/CP search to a smaller recipe: choose EP by model size, choose CP by target per-GPU sequence length, and tune PP only when needed. This recipe does not remove all tuning work, but it splits the problem into clearer categories: activations, logits, static state, MoE communication, local seq, and compute efficiency.\nOn the engineering side, this article is about how complex optimizations land. FSDP2 integrated with 5D parallelism and chunked EP overlap under full recompute can both be implemented in Megatron, but the engineering complexity is high. bumblebee places these changes on primitive boundaries and pairs them with baselines and Skill-style knowledge, so humans and agents can understand, modify, validate, and compose them faster.\nSeveral directions remain:\nExpand the memory estimator. The current analysis covers activations, logits, static state, and MoE buffers. It should also include FSDP all-gather buffers, current-unit parameters, workspace, fragmentation, and rank-wise peaks. Refine the boundary between FSDP2 and distributed optimizer. FSDP2 saves more static memory but introduces parameter all-gather / gradient reduce-scatter. distributed optimizer is simpler and has less communication, but parameters and gradients remain tied to PP/EP. Turn FSDP2 + 5D parallel + chunked EP into a more general long-context pretraining / RL tuning guide, especially for 100B and 1T+ models where PP, EP, CP, FSDP, and TP boundaries change. Continue optimizing chunked EP, including reducing memory fragmentation, trying more fused expert kernels, and pushing dispatch / expert compute / combine scheduling toward a training-time mega MoE kernel. Continue tuning RL performance. The current default keeps local seq around 16K-32K to amortize CPU overhead and kernel launch cost, but RL workloads can still tune the trade-off among throughput, memory, and recompute for their specific task. Acknowledgements Thanks to Zijie Yan, Hongbin Liu, and Kunlun Li for their support, discussions, and review.\n","permalink":"https://iseekyan.github.io/posts/qwen35-long-sequence-moe-rl/","summary":"\u003cp\u003eLong-context training is no longer a pretraining-only problem. In RL, prompts, rollouts, rewards, tool calls, and multi-turn interactions all push context length upward. The question many teams care about has also changed: can they start with a modest number of GPUs and reliably run 128K, 256K, or even longer-context algorithm experiments?\u003c/p\u003e\n\u003cp\u003eRL uses resources differently from pretraining. Pretraining usually optimizes for large-scale, stable, long-running throughput. RL often needs many experiments in parallel. When users have more GPUs, they usually want to run several reward, rollout, data-mixture, or hyperparameter experiments at the same time, instead of putting every GPU into one training job. A long-context RL training system first needs a low-friction, reusable starting point with few tuning requirements.\u003c/p\u003e","title":"Making Long-Context MoE RL Training Easier to Tune: Optimization Practice in Megatron-Lite / bumblebee"},{"content":"Hi, I am Yan Bai.\nI currently work at NVIDIA, where I focus on large-scale training systems and RL infrastructure, especially Megatron-Core, distributed parallel training, MoE, long-context training, and system problems in reinforcement learning training frameworks.\nI contributed full Megatron-Core support to veRL, making it the first public RL framework with DeepSeek V3 support. I also distilled my Megatron-Core experience into mbridge, an open-source bridge project on GitHub.\nI built a Megatron memory estimator for quickly estimating training memory under different parallel configurations. A related introduction is available on the NVIDIA technical blog: Explore using the Megatron-Core training framework to improve GPU memory efficiency in large model training.\nThis blog collects notes on distributed training, RL infrastructure, model systems, experiments, and lessons learned from building and debugging real training stacks.\n","permalink":"https://iseekyan.github.io/about/","summary":"\u003cp\u003eHi, I am Yan Bai.\u003c/p\u003e\n\u003cp\u003eI currently work at NVIDIA, where I focus on large-scale training systems and RL infrastructure, especially Megatron-Core, distributed parallel training, MoE, long-context training, and system problems in reinforcement learning training frameworks.\u003c/p\u003e\n\u003cp\u003eI contributed full Megatron-Core support to \u003ca href=\"https://github.com/volcengine/verl\"\u003eveRL\u003c/a\u003e, making it the first public RL framework with DeepSeek V3 support. I also distilled my Megatron-Core experience into \u003ca href=\"https://github.com/ISEEKYAN/mbridge\"\u003embridge\u003c/a\u003e, an open-source bridge project on GitHub.\u003c/p\u003e\n\u003cp\u003eI built a \u003ca href=\"https://huggingface.co/spaces/ISEEKYAN/megatron_memory_estimator\"\u003eMegatron memory estimator\u003c/a\u003e for quickly estimating training memory under different parallel configurations. A related introduction is available on the NVIDIA technical blog: \u003ca href=\"https://developer.nvidia.cn/blog/explore-using-the-megatron-core-training-framework-to-improve-gpu-memory-efficiency-in-large-model-training/\"\u003eExplore using the Megatron-Core training framework to improve GPU memory efficiency in large model training\u003c/a\u003e.\u003c/p\u003e","title":"About Me"},{"content":"This is a sample English post for testing bilingual navigation and comments.\nLarge-scale model training is usually not solved by one parallelism strategy alone. Instead, several dimensions are composed:\nFSDP shards parameters, gradients, and optimizer states to reduce model-state memory. Pipeline Parallelism splits model depth across devices. Context Parallelism shards long sequences and mainly reduces attention activation pressure. Expert Parallelism distributes MoE experts across devices, which is useful for models such as Qwen, DeepSeek, and Mixtral. You can replace this sample with a real article later.\n","permalink":"https://iseekyan.github.io/posts/fsdp-pp-cp-ep/","summary":"\u003cp\u003eThis is a sample English post for testing bilingual navigation and comments.\u003c/p\u003e\n\u003cp\u003eLarge-scale model training is usually not solved by one parallelism strategy alone. Instead, several dimensions are composed:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eFSDP shards parameters, gradients, and optimizer states to reduce model-state memory.\u003c/li\u003e\n\u003cli\u003ePipeline Parallelism splits model depth across devices.\u003c/li\u003e\n\u003cli\u003eContext Parallelism shards long sequences and mainly reduces attention activation pressure.\u003c/li\u003e\n\u003cli\u003eExpert Parallelism distributes MoE experts across devices, which is useful for models such as Qwen, DeepSeek, and Mixtral.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eYou can replace this sample with a real article later.\u003c/p\u003e","title":"FSDP, PP, CP, and EP: Four Parallel Dimensions in Large-Scale Training"}]