SP
BravenNow
veScale-FSDP: Flexible and High-Performance FSDP at Scale
| USA | technology | βœ“ Verified - arxiv.org

veScale-FSDP: Flexible and High-Performance FSDP at Scale

#FSDP #ZeRO #veScale-FSDP #RaggedShard #Large-scale AI training #Distributed computing #GPU scaling #Model parallelism

πŸ“Œ Key Takeaways

  • veScale-FSDP addresses limitations in current FSDP systems for advanced AI model training
  • The system introduces RaggedShard flexible sharding format and structure-aware planning algorithm
  • veScale-FSDP achieves 5-66% higher throughput and 16-30% lower memory usage than existing systems
  • The solution scales efficiently to tens of thousands of GPUs, enabling larger distributed training

πŸ“– Full Retelling

Researchers led by Zezhou Wang and 11 collaborators introduced veScale-FSDP, a redesigned Fully Sharded Data Parallel (FSDP) system, on February 25, 2026, through arXiv, addressing critical limitations in current FSDP implementations that hinder efficient training of cutting-edge AI models like Gemini and Kimi K2. The new system addresses fundamental constraints in existing FSDP architectures, which struggle with structure-aware training methods such as block-wise quantized training and non-element-wise optimizers including Shampoo and Muon. These limitations stem from FSDP's fixed element- or row-wise sharding formats that conflict with the block-structured computations required by advanced AI models. The researchers developed veScale-FSDP by coupling a flexible sharding format called RaggedShard with a structure-aware planning algorithm, enabling both flexibility and high performance at unprecedented scale. This approach natively supports the efficient data placement required by FSDP while empowering block-wise quantization and non-element-wise optimizers that were previously incompatible with distributed training frameworks.

🏷️ Themes

Distributed Computing, AI Training Optimization, System Architecture

πŸ“š Related People & Topics

Distributed computing

System with multiple networked computers

Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system communicate and coordinate their actions by passing messages t...

View Profile β†’ Wikipedia β†—

Entity Intersection Graph

No entity connections available yet for this article.

Mentioned Entities

Distributed computing

System with multiple networked computers

}
Original Source
--> Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.22437 [Submitted on 25 Feb 2026] Title: veScale-FSDP: Flexible and High-Performance FSDP at Scale Authors: Zezhou Wang , Youjie Li , Zhiqi Lin , Jiacheng Yang , Cong Xie , Guanyu Feng , Zheng Zhong , Ziyue Huang , Hongyu Zhu , Zhi Zhang , Yanghua Peng , Xin Liu View a PDF of the paper titled veScale-FSDP: Flexible and High-Performance FSDP at Scale, by Zezhou Wang and 11 other authors View PDF HTML Abstract: Fully Sharded Data Parallel , also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.22437 [cs.DC] (or arXiv:2602.22437v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.22437 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zezhou Wang [ view e...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

πŸ‡¬πŸ‡§ United Kingdom

πŸ‡ΊπŸ‡¦ Ukraine