3/6/2026 | USA | technology | ✓ Verified - arxiv.org

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

#KV cache #attention selection #transformer models #memory reduction #large language models

📌 Key Takeaways

Researchers propose a method to reduce KV cache size in transformer models.
The approach uses low-dimensional attention selection to compress key vectors.
This reduces memory usage while maintaining model performance.
The technique aims to make large language models more efficient for deployment.

📖 Full Retelling

arXiv:2603.04427v1 Announce Type: cross Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than v

🏷️ Themes

AI Efficiency, Model Optimization

Entity Intersection Graph

No entity connections available yet for this article.

}

Original Source

              --> Computer Science > Machine Learning arXiv:2603.04427 [Submitted on 16 Feb 2026] Title: Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection Authors: Hengshuai Yao , Guan Wang View a PDF of the paper titled Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection, by Hengshuai Yao and 1 other authors View PDF HTML Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values d_k \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph , while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $\sim\!\log_2 N$ dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75\% key cache savings at just 2.0\% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75\% key cache savings at $<$2\% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25\,GB of KV cache per user, enabling approximately 60\% more concurrent ...
            

Read full article at source

Source

arxiv.org

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

Entity Intersection Graph

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine