Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
#Semantic Parallelism #MoE Inference #Expert Parallelism #Model-Data Co-Scheduling #LLM Serving #SGLANG #Inter-device Communication
📌 Key Takeaways
- Semantic Parallelism addresses communication bottlenecks in expert parallelism
- Sem-MoE implements three key scheduling techniques to optimize model-data co-location
- Research shows reduced all-to-all communication volume in EP
- Implementation in SGLANG engine demonstrates superior inference throughput
📖 Full Retelling
Researchers Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, and Pengfei Zheng introduced Semantic Parallelism, a novel approach to optimize Mixture of Experts (MoE) inference in large language models, in their latest paper submitted to arXiv on March 6, 2025 (with v4 released on February 24, 2026). The research addresses critical inefficiencies in current expert parallelism (EP) implementations that suffer from expensive inter-device communication when routing tokens to remote experts not collocated on the same GPU/NPU device. The paper reveals that state-of-the-art schemes treat expert device-placement and token scheduling as separate concerns, triggering excessive communication and compromising inference efficiency. The researchers developed Sem-MoE, a framework implementing Semantic Parallelism through three key techniques: offline model scheduling that clusters experts based on co-activation tendencies, online inter-request data scheduling for Attention-DP setups, and online intra-request data scheduling for Attention-TP setups. By integrating Sem-MoE into the SGLANG LLM serving engine, the team demonstrated significant reductions in all-to-all communication volume and achieved superior inference throughput compared to existing solutions, marking a substantial advancement in distributed machine learning systems.
🏷️ Themes
Machine Learning Optimization, Distributed Computing, LLM Efficiency
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Machine Learning arXiv:2503.04398 [Submitted on 6 Mar 2025 ( v1 ), last revised 24 Feb 2026 (this version, v4)] Title: Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling Authors: Yan Li , Zhenyu Zhang , Zhengang Wang , Pengfei Chen , Pengfei Zheng View a PDF of the paper titled Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling, by Yan Li and 4 other authors View PDF HTML Abstract: Prevailing LLM serving engines employ expert parallelism to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency This paper proposes Semantic Parallelism, a novel parallelism paradigm that minimizes the steep communication costs in EP-centric MoE serving via model-data collaborative scheduling. We implement Semantic Parallelism in a framework called Sem-MoE. Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) Offline model scheduling, which preliminarily clusters and collocates experts onto devices based on their co-activation tendencies for certain classes of input. (2) Online inter-request data scheduling for Attention-DP setups, which proactively rebatches incoming requests onto the device that hosts experts most likely and frequently activated by the corresponding requests. (3) Online intra-request data scheduling for Attention-TP setups, which seamlessly fuses a token reshuffling procedure into the original inf...
Read full article at source