SP
BravenNow
veScale-FSDP: Flexible and High-Performance FSDP at Scale
| USA | technology | ✓ Verified - arxiv.org

veScale-FSDP: Flexible and High-Performance FSDP at Scale

#FSDP #ZeRO #veScale-FSDP #RaggedShard #Large-scale AI training #Distributed computing #GPU scaling #Model parallelism

📌 Key Takeaways

  • veScale-FSDP addresses limitations in current FSDP systems for advanced AI model training
  • The system introduces RaggedShard flexible sharding format and structure-aware planning algorithm
  • veScale-FSDP achieves 5-66% higher throughput and 16-30% lower memory usage than existing systems
  • The solution scales efficiently to tens of thousands of GPUs, enabling larger distributed training

📖 Full Retelling

Researchers led by Zezhou Wang and 11 collaborators introduced veScale-FSDP, a redesigned Fully Sharded Data Parallel (FSDP) system, on February 25, 2026, through arXiv, addressing critical limitations in current FSDP implementations that hinder efficient training of cutting-edge AI models like Gemini and Kimi K2. The new system addresses fundamental constraints in existing FSDP architectures, which struggle with structure-aware training methods such as block-wise quantized training and non-element-wise optimizers including Shampoo and Muon. These limitations stem from FSDP's fixed element- or row-wise sharding formats that conflict with the block-structured computations required by advanced AI models. The researchers developed veScale-FSDP by coupling a flexible sharding format called RaggedShard with a structure-aware planning algorithm, enabling both flexibility and high performance at unprecedented scale. This approach natively supports the efficient data placement required by FSDP while empowering block-wise quantization and non-element-wise optimizers that were previously incompatible with distributed training frameworks.

🏷️ Themes

Distributed Computing, AI Training Optimization, System Architecture

📚 Related People & Topics

Distributed computing

System with multiple networked computers

Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system communicate and coordinate their actions by passing messages t...

View Profile → Wikipedia ↗

Entity Intersection Graph

No entity connections available yet for this article.

Original Source
--> Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.22437 [Submitted on 25 Feb 2026] Title: veScale-FSDP: Flexible and High-Performance FSDP at Scale Authors: Zezhou Wang , Youjie Li , Zhiqi Lin , Jiacheng Yang , Cong Xie , Guanyu Feng , Zheng Zhong , Ziyue Huang , Hongyu Zhu , Zhi Zhang , Yanghua Peng , Xin Liu View a PDF of the paper titled veScale-FSDP: Flexible and High-Performance FSDP at Scale, by Zezhou Wang and 11 other authors View PDF HTML Abstract: Fully Sharded Data Parallel , also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.22437 [cs.DC] (or arXiv:2602.22437v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.22437 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zezhou Wang [ view e...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine