veScale-FSDP: Flexible and High-Performance FSDP at Scale
#FSDP #ZeRO #veScale-FSDP #RaggedShard #Large-scale AI training #Distributed computing #GPU scaling #Model parallelism
📌 Key Takeaways
- veScale-FSDP addresses limitations in current FSDP systems for advanced AI model training
- The system introduces RaggedShard flexible sharding format and structure-aware planning algorithm
- veScale-FSDP achieves 5-66% higher throughput and 16-30% lower memory usage than existing systems
- The solution scales efficiently to tens of thousands of GPUs, enabling larger distributed training
📖 Full Retelling
Researchers led by Zezhou Wang and 11 collaborators introduced veScale-FSDP, a redesigned Fully Sharded Data Parallel (FSDP) system, on February 25, 2026, through arXiv, addressing critical limitations in current FSDP implementations that hinder efficient training of cutting-edge AI models like Gemini and Kimi K2. The new system addresses fundamental constraints in existing FSDP architectures, which struggle with structure-aware training methods such as block-wise quantized training and non-element-wise optimizers including Shampoo and Muon. These limitations stem from FSDP's fixed element- or row-wise sharding formats that conflict with the block-structured computations required by advanced AI models. The researchers developed veScale-FSDP by coupling a flexible sharding format called RaggedShard with a structure-aware planning algorithm, enabling both flexibility and high performance at unprecedented scale. This approach natively supports the efficient data placement required by FSDP while empowering block-wise quantization and non-element-wise optimizers that were previously incompatible with distributed training frameworks.
🏷️ Themes
Distributed Computing, AI Training Optimization, System Architecture
📚 Related People & Topics
Distributed computing
System with multiple networked computers
Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system communicate and coordinate their actions by passing messages t...
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.22437 [Submitted on 25 Feb 2026] Title: veScale-FSDP: Flexible and High-Performance FSDP at Scale Authors: Zezhou Wang , Youjie Li , Zhiqi Lin , Jiacheng Yang , Cong Xie , Guanyu Feng , Zheng Zhong , Ziyue Huang , Hongyu Zhu , Zhi Zhang , Yanghua Peng , Xin Liu View a PDF of the paper titled veScale-FSDP: Flexible and High-Performance FSDP at Scale, by Zezhou Wang and 11 other authors View PDF HTML Abstract: Fully Sharded Data Parallel , also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.22437 [cs.DC] (or arXiv:2602.22437v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.22437 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zezhou Wang [ view e...
Read full article at source