Making Long-Context MoE RL Training Easier to Tune: Optimization Practice in Megatron-Lite / bumblebee

Long-context training is no longer a pretraining-only problem. In RL, prompts, rollouts, rewards, tool calls, and multi-turn interactions all push context length upward. The question many teams care about has also changed: can they start with a modest number of GPUs and reliably run 128K, 256K, or even longer-context algorithm experiments?

RL uses resources differently from pretraining. Pretraining usually optimizes for large-scale, stable, long-running throughput. RL often needs many experiments in parallel. When users have more GPUs, they usually want to run several reward, rollout, data-mixture, or hyperparameter experiments at the same time, instead of putting every GPU into one training job. A long-context RL training system first needs a low-friction, reusable starting point with few tuning requirements.

This article uses Qwen3.5-35B-A3B / 32 x H100 / 128K as the running case. It represents a common request from long-context MoE users: can a 35B-class model run quickly on a few dozen GPUs, and can the resulting setup be simple enough for later algorithm work?

The article first walks through how this case moves from a Megatron-style baseline to a smaller recipe. Then it steps back to the engineering side. These optimizations can also be implemented in Megatron, but integrating 5D parallelism, FSDP2, full recompute, and chunked EP overlap has a large implementation surface. bumblebee’s primitive + Skill design breaks the work into smaller pieces that are easier for humans and agents to collaborate on.

1. The problem: long-context MoE RL is tightly coupled

1.1 RL long-context training has different resource goals

35B / 32 x H100 / 128K looks like one training configuration. In practice, it pulls in a set of tightly coupled tuning choices:

Tuning choice	Example values	Main coupling points
PP	1 / 2 / 4 / 8	Pipeline bubble, stage partitioning, batch shape
TP	1 / 2 / 4 / 8	Sequence parallelism, collective communication, vocab partitioning
EP	1 / 4 / 8 / 16 / 32	Number of experts, token dispatch, expert GEMM shape
CP	1 / 2 / 4 / 8	Per-GPU sequence length, attention communication, KV head / group constraints
recompute	none / selective / full	Activation memory and repeated compute
offload	off / selective	Optimizer throughput and host-device bandwidth

This article does not treat local sequence length as an independent tuning choice. It is the result of global sequence length, CP partitioning, TP/EP layout, and batch shape. Whether it lands at 16K, 32K, or 64K directly affects activations, attention, MoE buffers, kernel shapes, and CPU overhead.

Config to factor map

local seq is closely tied to compute efficiency. When the per-GPU sequence is too short, compute is fragmented, and kernel launch, runtime submit, MoE dispatch scheduling, and CPU-side overhead become more visible. With a longer per-GPU sequence, each GPU receives a larger contiguous block of work. Many fixed scheduling costs are amortized, and expert GEMM and attention are more likely to run in a stable regime. Long-context RL tuning is often about finding a reusable balance between memory pressure, local seq, exposed communication, and compute efficiency.

1.2 Failed configurations usually come from several factors at once

When a configuration OOMs, several low-level factors are usually pushed up together:

Static memory: parameters, gradients, optimizer states, all-gather buffers;
Dynamic memory: activations, MoE buffers, logits / loss intermediates;
Compute efficiency: local seq, expert token count, batch shape;
Exposed communication: attention / CP communication, MoE dispatch / combine, parameter all-gather / reshard;
CPU overhead: small kernels, small batch slices, frequent launches and scheduling.

A configuration can also run but be slow. That often means dynamic memory has been addressed, while compute efficiency or exposed communication has become the bottleneck. This is the hard part in long-context RL: changing one parallelism option also changes memory, local seq, communication patterns, and kernel shapes. The optimizations below split the problem along these underlying factors.

1.3 The real cost of a Megatron-style all-in-one system

Megatron is a capable training system. It supports many models, parallel strategies, optimizers, checkpointing paths, recompute modes, kernel fusions, and legacy configuration paths. Many of the optimizations discussed here can be implemented in Megatron, and the lower-level capabilities often come from the Megatron and Transformer Engine ecosystem.

The cost mainly comes from the cognitive load of the all-in-one design. Model structure, parallel state, optimizer/checkpoint semantics, recompute scheduling, performance flags, and legacy compatibility paths are intertwined in one large system. When adding a feature or debugging a performance problem, a developer needs to understand a long call chain and a wide configuration surface. For an agent, that large context also makes task decomposition and validation harder.

Two examples from this article:

Integrating FSDP2 into Megatron 5D parallelism is far more than swapping an optimizer. It has to handle parameter sharding, grad reduce-scatter, optimizer state, PP/TP/EP/CP process groups, checkpoint resharding, parameter materialization under full recompute, and semantic alignment with the existing distributed optimizer.
chunked EP overlap is also much more than splitting tokens into a few chunks. Under full recompute, the dependencies among forward, recompute forward, backward, delayed wgrad, dispatch, and combine all need to be redesigned. Too little change may produce no overlap; too much can break numerics or scheduling semantics.

This work can be done in Megatron, but the implementation and validation surface is large. The next section walks through the optimization path first. The final section returns to bumblebee’s design and explains why primitive boundaries make these complex changes easier to split into incremental development and agent collaboration tasks.

2. From baseline to a smaller recipe

2.1 Baseline: the usable point is already expensive

In a conventional Megatron-style tuning space, a representative usable high point for 35B / 32 x H100 / 128K is:

Model	Baseline config	TFLOPs/GPU	Peak memory
Qwen3.5-35B-A3B	`TP=2, PP=4, EP=8, CP=2, full recompute`	127.53	42.91 GB

Megatron baseline snapshot

This point already went through PP, EP, CP, and recompute filtering. The conventional path works, but continuing to move within the same high-dimensional space has poor marginal returns and high trial cost. To make the recipe easier to reuse, one important step is to decouple static memory from PP/EP-style parallel layouts.

2.2 Path choice: accept full recompute and increase local seq

The first problem in long-context MoE training is memory, but memory immediately pulls in compute efficiency, exposed communication, and CPU overhead. There are at least two reasonable ways to fight OOM.

Path A: less recompute, smaller per-GPU sequence

Path A pressure

The first path avoids recompute as much as possible, or uses only a small amount of selective recompute. The benefit is direct: less repeated computation per step.

The cost is just as clear: activations remain in memory, so the memory budget is tighter. To fit a 128K global sequence, the per-GPU sequence often has to shrink, which means increasing CP. CP can partition the sequence dimension, but it introduces attention-side collective communication and is constrained by model structure such as KV heads / groups. For models like Qwen3.5, CP cannot grow without limit.

This path can still be optimized. For example, pipeline windows from adjacent microbatches can cover part of the CP communication. But it creates a chain reaction: fitting activations requires smaller local seq; smaller local seq makes compute efficiency and CPU overhead more sensitive; larger CP adds more attention-side communication and scheduling constraints.

Path B: accept recompute and increase per-GPU sequence

Path B pressure

The second path accepts more aggressive recompute. It reduces resident activation memory and increases local seq.

A larger local seq helps in several ways:

Each GPU processes a longer contiguous block of work, so CPU launch / dispatch overhead is easier to amortize;
CP has more headroom and does not have to hit model-structure limits immediately;
expert GEMM, attention, and runtime submit are more likely to run in a stable regime;
small overheads that would otherwise need careful tuning are covered by larger compute blocks.

This path also has a cost. full recompute repeats part of the forward computation. In our training accounting, that is roughly 30% extra compute. Scheduling cannot remove that cost; it can only be offset by larger compute blocks and less exposed system overhead. MoE communication also does not disappear just because recompute is enabled.

This article chooses Path B. Long-context RL needs a good default starting point. If dynamic memory and CPU overhead are handled first, users do not have to blindly sweep PP/TP/EP/CP/recompute/offload from the beginning. For peak performance, the trade-off between less recompute and more recompute can still be tuned later.

Low-level factor	Path A: smaller local seq	Path B: larger local seq
Dynamic memory	More activations kept resident	Lower by default
Static memory	Still needs handling	Still needs handling
Compute efficiency	Small local seq fragments work	Overheads are easier to amortize
Exposed communication	Higher CP / attention pressure	MoE communication still needs handling
CPU overhead	More visible	Less visible

Following Path B still leaves two tasks: static memory needs to be decoupled from PP/EP-style parallel choices, and MoE communication needs to be hidden inside compute windows as much as possible.

2.3 Memory path: handle activations, logits, and static state separately

Step 0: conventional parallelism and recompute reach 120+

Along Path B, conventional parallelism plus full recompute first gets 35B / 32 x H100 / 128K running:

TP=2, PP=4, EP=8, CP=2, full recompute
127.53 TFLOPs/GPU, peak 42.91 GB

This point shows that the basic path is viable, and also exposes the problem: even with distributed optimizer, static memory still stacks with the dynamic peak. If memory is handled only through PP/EP/CP, tuning falls back into the same coupled space.

Step 1: use linear CE to remove the logits peak

For long sequence length and large vocab, materializing full logits before cross-entropy can make the loss-side temporary tensors a major part of peak memory.

Take the Qwen3.5 text vocab size of 248,320 as an example. In the baseline with TP=2, each vocab-parallel rank has about 124,160 vocab columns. With CP=2, each rank has 131,072 / 2 = 65,536 local tokens. If fp32 logits are materialized before loss:

local_tokens * partition_vocab * 4 bytes
= 65,536 * 124,160 * 4
= 32.55 GB decimal ~= 30.32 GiB

Without vocab partitioning by TP, the same local token count would produce about 65.10 GB decimal. This scale makes logits / loss intermediates a primary peak-memory term.

The fix is linear CE: stream along the vocab dimension to compute loss and backward without keeping full logits in memory. This optimization has a cost: the loss layer itself becomes slower. In our measurements, loss-layer time is about 1.3x the original. The benefit is a much lower long-context memory peak. With this type of loss-side optimization, a representative 35B training point moves into the roughly 160 TFLOPs/GPU range:

Config	TFLOPs/GPU	Peak memory
`TP=1, PP=2, EP=8, CP=4, full recompute`	162.07	55.91 GB

The point here is to explain why the logits peak around loss must be handled separately. The numbers should not be read as a strict single-toggle on/off ablation. Without this fix, long-context dynamic memory repeatedly pushes configurations back to the edge of OOM.

Step 2: use FSDP2 for static state

linear CE handles the loss-side dynamic peak. Parameters, gradients, and optimizer states remain a harder static-memory problem.

The distributed optimizer in the baseline already helps, but it does not remove static memory pressure. It is closer to ZeRO-1 semantics: optimizer states are sharded, while model parameters and gradients still reside according to PP/TP/EP parallel layouts. At 128K sequence length, that static memory stacks with activations, logits, MoE buffers, and all-gather buffers at peak.

From a training-state sharding perspective, Megatron distributed optimizer and FSDP2 can be understood through ZeRO semantics:

distributed optimizer is closer to ZeRO-1: optimizer states are sharded along the data-parallel dimension, but parameters and gradients still follow model-parallel layouts;
FSDP2 is closer to ZeRO-3: parameters, gradients, and optimizer states are globally sharded, and parameters are all-gathered when the current FSDP unit computes.

If nGPU is the total number of GPUs participating in training and n_param is the total number of model parameters, optimizer state under distributed optimizer can be roughly estimated as:

optimizer state per GPU ~= n_param * 12 / nGPU

The part that still needs to shrink is parameters and gradients. Since Qwen3.5 MoE parameters dominate, parameter and gradient memory under distributed optimizer is mainly tied to PP and EP:

dist_opt parameter + grad per GPU ~= n_param * 6 / (PP * EP)

FSDP2 aims to push parameters and gradients toward global sharding as well:

FSDP2 parameter + grad per GPU ~= n_param * 6 / nGPU

Static memory item	distributed optimizer / ZeRO-1 view	FSDP2 / ZeRO-3 target view
optimizer state	`n_param * 12 / nGPU`	`n_param * 12 / nGPU`
parameter + grad	`n_param * 6 / (PP * EP)`	`n_param * 6 / nGPU`
Main difference	Parameters and gradients remain tied to PP/EP-style model-parallel partitioning	Parameters and gradients are also globally sharded
Cost	No extra parameter all-gather / grad reduce-scatter	Requires parameter all-gather and grad reduce-scatter

In the same main parallel setting, 35B, 32 x H100, 128K, PP=2, EP=8, CP=4, full recompute, FSDP2 lowers peak by about 8.88 GB with essentially no throughput loss:

backend	optimizer semantics	TFLOPs/GPU	Peak memory
distributed optimizer	approx. ZeRO-1	162.07	55.91 GB
FSDP2	approx. ZeRO-3	163.06	47.03 GB

After parameters and gradients are sharded further, static-memory headroom increases noticeably. PP and EP no longer need to act as the default static-memory-saving tools. Their roles become cleaner:

Tuning choice	Main role after FSDP2
CP	Tune local seq so the per-GPU sequence lands in a good memory and compute-efficiency range
EP	Tune MoE compute and communication efficiency, avoiding cross-node all-to-all degradation
PP	Use for peak-performance or peak-memory tuning when needed, not as the default static-memory tool

With this in mind, PP=1, EP=8, CP=4 becomes worth trying. In pure FSDP2 no-chunk mode, the historical high point for PP=1 is about 180.18 TFLOPs/GPU with 60.54 GB peak. After adding chunked EP overlap later, the same main configuration moves into the 37-38 GB peak range and approaches 190 TFLOPs/GPU.

At this point, the memory cost along Path B has been split into three parts: activations through full recompute, logits through linear CE, and static state through FSDP2. The remaining major problem is exposed MoE all-to-all communication.

2.4 Communication path: chunked EP overlap

In long-context MoE, compute is already dense, but expert-parallel token dispatch / combine all-to-all can still be exposed on the timeline.

The direct idea is 1F1B overlap: use compute windows from adjacent pipeline microbatches to cover dispatch / combine. But the memory path in this article uses full recompute by default, and the old 1F1B overlap path does not directly support it. Falling back to a low-recompute path just to reuse it would bring back the dynamic-memory, local seq, and CPU-overhead issues avoided earlier.

So we redesigned MoE EP A2A overlap to support full recompute. The mechanism is straightforward: split tokens into multiple chunks, execute communication and compute streams with an offset, and place dispatch / combine for neighboring chunks inside expert-compute windows when possible. Communication that was serialized and exposed can then be covered by adjacent chunk compute.

Two terms will appear repeatedly:

no-chunk: one MoE layer dispatches all tokens in the current microbatch to experts at once, runs grouped GEMM, then combines everything back at once. The timeline can be simplified as dispatch -> grouped GEMM -> combine.
chunk2: split the same MoE layer input into two token chunks. Each chunk has its own dispatch, expert compute, and combine. The scheduler offsets communication and compute streams so that combine from the previous chunk or dispatch from the next chunk can land inside a neighboring expert-compute window.

Larger chunk counts, such as chunk3 or chunk4, can also be tried. More chunks are not automatically better. Too many chunks increase kernel inflation, runtime submit overhead, and scheduling overhead. They may also make expert GEMMs too small and reduce compute efficiency. This article focuses on the basic mechanism from no-chunk to chunk2, and why overlap becomes more valuable at longer sequence lengths.

Kernel schedule: the gain comes from real windows

Forward EP chunk kernel schedule

The forward no-chunk schedule can be simplified as:

dispatch -> grouped GEMM -> combine

With chunk2, the communication stream executes c0 dispatch / c1 dispatch / c0 combine / c1 combine in order, while the compute stream executes c0 GEMM / c1 GEMM in order. The dependencies are:

c0 dispatch -> c0 GEMM -> c0 combine
c1 dispatch -> c1 GEMM -> c1 combine

On the key rank for 16K forward, the no-chunk measured total is 10.36 ms; chunk2 is 7.88 ms. This should not be interpreted as “splitting into two chunks hides half the communication.” In the actual trace, both c0 dispatch / c1 dispatch happen before the expert window. The visible forward overlap is c0 combine being pushed into c1 GEMM; c1 combine still forms the tail.

Backward EP chunk kernel schedule

Backward runs in the opposite semantic order: first combine-bwd, then expert backward compute, then dispatch-bwd. In the corresponding 16K backward window, no-chunk is 13.18 ms and chunk2 is 11.77 ms. The compute cannot be drawn as one large “expert backward” block. It needs to be split into dgrad and delayed wgrad. delayed wgrad must run after the last chunk’s dgrad, and it can overlap with the tail dispatch-bwd.

Fused forward-backward EP chunk kernel schedule

When recompute forward and backward are fused, the gain also comes from less repeated compute. The fused path lets backward directly consume recompute expert output. Part of combine in recompute forward can be removed, and the recompute phase no longer needs to run the fc2 that would normally produce the MoE layer output. This removes a tail section of recompute forward that existed only to prepare intermediate results for backward.

This requires reorganizing forward / backward dependencies. The change is about where data is produced and consumed: recompute forward keeps only the expert intermediates backward actually needs, while combine and fc2 are either consumed by the fused backward path or skipped. This improves the alignment between communication and compute windows and reduces part of the repeated work in the full-recompute path.

Longer sequence, larger overlap gain

To isolate the local trend of MoE EP A2A overlap, we first look at a 35B-A3B sparse MoE single-layer proxy. This proxy covers forward + backward for one sparse MoE layer. It does not include attention, dense MLP, optimizer, or other transformer layers, so it should not be read as end-to-end train-step speedup. It is mainly used to explain how overlap changes with sequence length.

35B sequence step time

35B sequence peak memory

35B sequence speedup

seq	no-chunk step ms	optimized step ms	speedup
4K	11.396	10.503	+7.84%
8K	15.988	14.625	+8.52%
16K	28.457	24.798	+12.86%
32K	54.412	44.375	+18.45%
64K	107.988	82.105	+23.97%

The trend is stable: the longer the sequence, the larger the communication window and the larger the gain from chunked EP overlap. Chunk count, EP size, and asymmetric chunk weights still need care. Too many chunks add kernel inflation and runtime submit overhead; too large an EP size may lose the gain to cross-node communication. The practical conclusion is specific: in long-context MoE, EP A2A needs scheduling-level treatment, and the chunk policy still depends on sequence length, EP size, and kernel shape.

2.5 Combined result: from 4D search to a recipe

After the memory path and communication path are both in place, the setup becomes viable:

OOM risk is reduced: memory is handled separately by full recompute, linear CE, and FSDP2;
Performance is high enough: exposed MoE all-to-all is handled by chunked EP overlap.

The memory path answers whether the model can run. The communication path decides whether it runs fast enough. Together, they give a usable default point for long-context RL.

35B tuning path

The 35B path can be read in the actual tuning order:

Stage	Main config	What changed	Problem addressed	Representative result
Megatron baseline	`TP=2, PP=4, EP=8, CP=2`	Conventional parallelism and full recompute tuning	First usable high point	127.53 TFLOPs/GPU, 42.91 GB
linear CE	`TP=1, PP=2, EP=8, CP=4`	Avoid materializing full logits; loss layer about 1.3x slower	Loss-side dynamic peak	162.07 TFLOPs/GPU, 55.91 GB
FSDP2	`TP=1, PP=2, EP=8, CP=4`	Keep main parallelism unchanged; replace optimizer primitive	Globally shard static state	163.06 TFLOPs/GPU, 47.03 GB
better parallel	`TP=1, PP=1, EP=8, CP=4`	Use static-memory headroom to reduce PP bubble	Make PP=1 a usable high point	180.18 TFLOPs/GPU, 60.54 GB
chunked EP	`TP=1, PP=1, EP=8, CP=4`	Shorten MoE communication and buffer lifetime	Handle exposed EP all-to-all	185.96-187.10 TFLOPs/GPU, 37-38 GB

The table is mainly about tuning order: pin down the baseline; use linear CE for logits; show that FSDP2 lowers static memory under the same main parallelism; use the static-memory headroom from FSDP2 to choose a better PP=1, EP=8, CP=4; then add chunked EP overlap to address the exposed communication in nsys traces.

One detail matters: the 37-38 GB peak comes from FSDP2 plus chunked EP. It is not the memory result of pure FSDP2 no-chunk.

In the old space, users have to tune PP/TP/EP/CP together, and every choice affects memory, communication, and kernel shapes. The goal of this path is to reduce the default tuning space into a smaller recipe:

4D to 2D recipe

Tuning choice	Default	When to tune
EP	Use 8 for models <=100B; use 32 for >100B	Choose by model size first
CP	Keep per-GPU seq around 16K-32K	Derive from global sequence length
PP	Optional 2	Use for peak performance, lower peak memory, or higher local seq limit
TP	Remove by default	Use only with a clear model or hardware reason

The change can be summarized in three points:

TP is removed from the default recipe;
PP and EP are no longer treated as the main static-memory tools; FSDP2 handles static memory independently;
The default recipe first chooses EP by model size, then chooses CP by target per-GPU sequence length; PP is tuned only when chasing peak performance.

Recipe overlay

This figure is a summary, not a measured cell. It maps back to the low-level factors from the beginning: dynamic memory, CPU overhead, static memory, compute efficiency, and exposed communication each need a default treatment. Once those are covered separately, the recipe does not become another large set of tuning choices.

3. Why bumblebee fits this kind of optimization and agent collaboration

3.1 Megatron can do it too; the integration cost differs

linear CE, FSDP2, and chunked EP overlap can all be moved into heavier training frameworks. The underlying kernels and parallel semantics also rely heavily on the Megatron / Transformer Engine ecosystem.

The main difference is integration and validation cost.

Integrating FSDP2 into Megatron 5D parallelism means dealing with PP/TP/EP/CP/DP process groups, parameter all-gather, gradient reduce-scatter, optimizer state, distributed checkpointing, recompute, MoE expert parameters, embedding/head sharing paths, and existing distributed optimizer semantics. Any mismatch can show up as a loss, grad norm, peak memory, or step-time anomaly.

chunked EP overlap is similar. It changes the timing relationship among MoE dispatch/combine and expert compute. Under full recompute, it also has to account for recompute forward / backward dependencies, delayed wgrad placement, buffer lifetime for different chunks, and synchronization points between communication and compute streams. It may look like a local schedule optimization, but it touches the full forward/backward semantics of the MoE primitive.

These complexities do not disappear under a new project name. bumblebee reduces the change boundary so developers and agents can focus on the current primitive instead of loading the whole all-in-one framework at once.

3.2 Primitive boundaries reduce the change surface

bumblebee is the internal codename for Megatron-Lite. It can also be seen as the library form that grew out of this lightweight training implementation. A model is composed from primitives rather than inheriting from a constantly expanding monolithic model class. Runtime owns the training protocol, Primitive owns replaceable capabilities, and Model chooses and composes them.

For the three optimizations in this article, the mapping is clear:

linear CE replaces a loss primitive;
FSDP2 replaces an optimizer / state-sharding primitive;
chunked EP overlap replaces a MoE communication primitive.

Each replacement has relatively clear inputs, outputs, and validation criteria. Models gain capabilities by composing primitives, and optimizations can land along primitive boundaries.

3.3 Paired baselines make local optimization easier to validate

bumblebee keeps paired baselines. The old primitive and new primitive are interchangeable. With the same model, same input, and same parallel configuration, we can compare peak memory, step time, loss, logits, grad norm, and key intermediate tensors.

That matters for complex optimizations. FSDP2 changes the lifetime of parameters and optimizer state. chunked EP changes the schedule of communication and compute. Both can affect performance, memory, and numerics at the same time. A paired baseline gives each replacement a clear reference point. Developers do not have to guess where a change came from inside the full training stack, and agents can work in a small loop: read the Skill, modify the primitive, run paired tests, report the delta.

3.4 Agent-native: small boundaries, Skills, and incremental tasks

An agent has a hard time understanding a 30K-line all-in-one framework in one shot. It is much better at tasks with clear boundaries, explicit inputs and outputs, and fixed validation methods.

bumblebee’s agent-native design maps to a concrete workflow:

Each primitive can have a Skill explaining what it does, how to compose it, how to validate it, and what trade-offs it has;
Each development task only needs to handle one local replacement, then validate it against a paired baseline;
Models can grow incrementally instead of supporting every feature on day one.

This is where bumblebee works well with agents. It does not make FSDP2 or chunked EP theoretically easier. It makes the engineering work smaller, more testable, and easier to roll back.

4. Summary and next steps

Long-context MoE RL training is hard to tune for concrete reasons: PP/TP/EP/CP/recompute/offload all change memory, communication, compute efficiency, and CPU overhead at the same time. The path in this article first accepts full recompute and increases local seq, using larger compute blocks to reduce the complexity of dynamic memory and host overhead. It then uses linear CE for the logits peak, FSDP2 for static state, and chunked EP overlap for exposed MoE all-to-all.

The default space users face is reduced from a 4D PP/TP/EP/CP search to a smaller recipe: choose EP by model size, choose CP by target per-GPU sequence length, and tune PP only when needed. This recipe does not remove all tuning work, but it splits the problem into clearer categories: activations, logits, static state, MoE communication, local seq, and compute efficiency.

On the engineering side, this article is about how complex optimizations land. FSDP2 integrated with 5D parallelism and chunked EP overlap under full recompute can both be implemented in Megatron, but the engineering complexity is high. bumblebee places these changes on primitive boundaries and pairs them with baselines and Skill-style knowledge, so humans and agents can understand, modify, validate, and compose them faster.

Several directions remain:

Expand the memory estimator. The current analysis covers activations, logits, static state, and MoE buffers. It should also include FSDP all-gather buffers, current-unit parameters, workspace, fragmentation, and rank-wise peaks.
Refine the boundary between FSDP2 and distributed optimizer. FSDP2 saves more static memory but introduces parameter all-gather / gradient reduce-scatter. distributed optimizer is simpler and has less communication, but parameters and gradients remain tied to PP/EP.
Turn FSDP2 + 5D parallel + chunked EP into a more general long-context pretraining / RL tuning guide, especially for 100B and 1T+ models where PP, EP, CP, FSDP, and TP boundaries change.
Continue optimizing chunked EP, including reducing memory fragmentation, trying more fused expert kernels, and pushing dispatch / expert compute / combine scheduling toward a training-time mega MoE kernel.
Continue tuning RL performance. The current default keeps local seq around 16K-32K to amortize CPU overhead and kernel launch cost, but RL workloads can still tune the trade-off among throughput, memory, and recompute for their specific task.

Acknowledgements

Thanks to Zijie Yan, Hongbin Liu, and Kunlun Li for their support, discussions, and review.

1. The problem: long-context MoE RL is tightly coupled#

1.1 RL long-context training has different resource goals#

1.2 Failed configurations usually come from several factors at once#

1.3 The real cost of a Megatron-style all-in-one system#

2. From baseline to a smaller recipe#

2.1 Baseline: the usable point is already expensive#

2.2 Path choice: accept full recompute and increase local seq#

Path A: less recompute, smaller per-GPU sequence#

Path B: accept recompute and increase per-GPU sequence#

2.3 Memory path: handle activations, logits, and static state separately#

Step 0: conventional parallelism and recompute reach 120+#

Step 1: use linear CE to remove the logits peak#

Step 2: use FSDP2 for static state#

2.4 Communication path: chunked EP overlap#

Kernel schedule: the gain comes from real windows#

Longer sequence, larger overlap gain#

2.5 Combined result: from 4D search to a recipe#

3. Why bumblebee fits this kind of optimization and agent collaboration#

3.1 Megatron can do it too; the integration cost differs#

3.2 Primitive boundaries reduce the change surface#

3.3 Paired baselines make local optimization easier to validate#

3.4 Agent-native: small boundaries, Skills, and incremental tasks#

4. Summary and next steps#

Acknowledgements#