LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
#Diffusion models#LESA framework#Model acceleration#Feature caching#Computer vision#AI efficiency#Generative AI
๐ Key Takeaways
LESA framework accelerates diffusion models through learnable stage-aware predictors
Method uses two-stage training with Kolmogorov-Arnold Network for temporal feature mapping
Multi-stage, multi-expert architecture assigns specialized predictors to different noise-level stages
Achieves significant acceleration (5x-6.25x) across multiple models with minimal or improved quality
๐ Full Retelling
Researchers Peiliang Cai and five collaborators from various institutions introduced LESA, a Learnable Stage-Aware predictor framework designed to accelerate diffusion models, in a paper submitted to the arXiv repository on February 24, 2026, addressing the significant computational challenges hindering the practical deployment of Diffusion Transformers in image and video generation tasks. The research team developed this innovative approach to overcome limitations in existing acceleration strategies that struggle with the complex, stage-dependent dynamics of the diffusion process, which often result in quality degradation when attempting to speed up these computationally intensive models.
LESA represents a significant advancement through its two-stage training approach that leverages a Kolmogorov-Arnold Network to accurately learn temporal feature mappings from data, combined with a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages. This specialized design enables more precise and robust feature forecasting throughout the diffusion process, addressing the fundamental challenge of maintaining generation quality while reducing computational requirements. The framework's effectiveness is demonstrated through comprehensive testing across multiple state-of-the-art diffusion models.
Experimental results showcase the remarkable performance of LESA, achieving 5.00x acceleration on FLUX.1-dev with only a 1.0% quality degradation, an impressive 6.25x speedup on Qwen-Image with a 20.2% quality improvement over previous state-of-the-art methods, and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. The researchers emphasize that this state-of-the-art performance across both text-to-image and text-to-video synthesis tasks validates the effectiveness and generalization capability of their training-based framework. The implementation details have been included in the supplementary materials, with plans for full release on GitHub to facilitate further research and application in the field of generative AI.
๐ท๏ธ Themes
AI acceleration, Diffusion models, Computer vision
Technique for the generative modeling of a continuous probability distribution
In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of ...
Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies th...
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20497 [Submitted on 24 Feb 2026] Title: LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration Authors: Peiliang Cai , Jiacheng Liu , Haowen Xu , Xinyu Wang , Chang Zou , Linfeng Zhang View a PDF of the paper titled LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration, by Peiliang Cai and 5 other authors View PDF HTML Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA , and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary...