3/6/2026 | USA | technology | ✓ Verified - arxiv.org

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

#Mixture of Experts #Depth-Width Transformation #Neural Networks #AI Scaling #Computational Efficiency #Model Capacity #Deep Learning

📌 Key Takeaways

Researchers propose a method to scale neural network width virtually by transforming depth into width.
The approach uses a mixture of universal experts to enhance model capacity without increasing actual parameters.
This technique aims to improve computational efficiency and performance in large-scale AI models.
The method demonstrates potential for more flexible and scalable deep learning architectures.

📖 Full Retelling

arXiv:2603.04971v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token a

🏷️ Themes

AI Scaling, Neural Networks

📚 Related People & Topics

Mixture of experts

Machine learning technique

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning. They were also called committee machines.

View Profile → Wikipedia ↗

Neural network

Structure in biology and artificial intelligence

A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or mathematical models. While individual neurons are simple, many of them together in a network can perform complex tasks.

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Mixture of experts:

🌐 Graph neural network 1 shared

View full profile

Mentioned Entities

Mixture of experts

Machine learning technique

Neural network

Structure in biology and artificial intelligence

}

Original Source

              --> Computer Science > Machine Learning arXiv:2603.04971 [Submitted on 5 Mar 2026] Title: Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation Authors: Yilong Chen , Naibin Gu , Junyuan Shang , Zhenyu Zhang , Yuchen Feng , Jiawei Sheng , Tingwen Liu , Shuohuan Wang , Yu Sun , Hua Wu , Haifeng Wang View a PDF of the paper titled Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation, by Yilong Chen and 10 other authors View PDF HTML Abstract: Mixture-of-Experts decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts ,a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures. Comments: 19 pages, 10 figures Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.04971 [cs.LG] (or arXiv:2603.04971v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.04971 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission ...
            

Read full article at source

Source

arxiv.org

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

📚 Related People & Topics

Mixture of experts

Neural network

Entity Intersection Graph

Mentioned Entities

Mixture of experts

Neural network

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine