<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Context-Parallel on Yan Bai</title><link>https://iseekyan.github.io/tags/context-parallel/</link><description>Recent content in Context-Parallel on Yan Bai</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 17 May 2026 01:10:00 +0800</lastBuildDate><atom:link href="https://iseekyan.github.io/tags/context-parallel/index.xml" rel="self" type="application/rss+xml"/><item><title>FSDP, PP, CP, and EP: Four Parallel Dimensions in Large-Scale Training</title><link>https://iseekyan.github.io/posts/fsdp-pp-cp-ep/</link><pubDate>Sun, 17 May 2026 01:10:00 +0800</pubDate><guid>https://iseekyan.github.io/posts/fsdp-pp-cp-ep/</guid><description>&lt;p>This is a sample English post for testing bilingual navigation and comments.&lt;/p>
&lt;p>Large-scale model training is usually not solved by one parallelism strategy alone. Instead, several dimensions are composed:&lt;/p>
&lt;ul>
&lt;li>FSDP shards parameters, gradients, and optimizer states to reduce model-state memory.&lt;/li>
&lt;li>Pipeline Parallelism splits model depth across devices.&lt;/li>
&lt;li>Context Parallelism shards long sequences and mainly reduces attention activation pressure.&lt;/li>
&lt;li>Expert Parallelism distributes MoE experts across devices, which is useful for models such as Qwen, DeepSeek, and Mixtral.&lt;/li>
&lt;/ul>
&lt;p>You can replace this sample with a real article later.&lt;/p></description></item></channel></rss>