SP
BravenNow
One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache
| USA | technology | ✓ Verified - arxiv.org

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

#KV cache #adaptive compression #large language models #memory optimization #token-wise #AI scalability #model efficiency

📌 Key Takeaways

  • Researchers propose a token-wise adaptive compression method for KV cache to reduce memory usage in large language models.
  • The method dynamically adjusts compression intensity per token based on importance, improving efficiency over uniform approaches.
  • It achieves significant memory savings while maintaining model accuracy, outperforming existing static compression techniques.
  • This innovation addresses scalability challenges in deploying LLMs on resource-constrained devices.

📖 Full Retelling

arXiv:2603.04411v1 Announce Type: cross Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we pro

🏷️ Themes

AI Efficiency, Model Compression

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
--> Computer Science > Computation and Language arXiv:2603.04411 [Submitted on 3 Feb 2026] Title: One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache Authors: Liming Lu , Kaixi Qiu , Jiayu Zhou , Jushi Kai , Haoyan Zhang , Huanyu Wang , Jingwen Leng , Ziwei He , Zhouhan Lin View a PDF of the paper titled One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache, by Liming Lu and 8 other authors View PDF HTML Abstract: Despite the remarkable progress of Large Language Models , the escalating memory footprint of the Key-Value cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.04411 [cs.CL] (or arXiv:2603.04411v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.04411 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Liming Lu [ view email ] [v1] Tue, 3 Feb 2026 13:20:36...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine