SP
BravenNow
Data-Aware Random Feature Kernel for Transformers
| USA | technology | ✓ Verified - arxiv.org

Data-Aware Random Feature Kernel for Transformers

📖 Full Retelling

arXiv:2603.04127v1 Announce Type: cross Abstract: Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
--> Computer Science > Machine Learning arXiv:2603.04127 [Submitted on 4 Mar 2026] Title: Data-Aware Random Feature Kernel for Transformers Authors: Amirhossein Farzam , Hossein Mobahi , Nolan Andrew Miller , Luke Sernau View a PDF of the paper titled Data-Aware Random Feature Kernel for Transformers, by Amirhossein Farzam and 3 other authors View PDF HTML Abstract: Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training stability. Motivated by this finding, we introduce DARKFormer, a Data-Aware Random-feature Kernel transformer that features a data-aligned kernel geometry. DARKFormer learns the random-projection covariance, efficiently realizing an importance-sampled positive random-feature estimator for its data-aligned kernel. Empirically, DARKFormer narrows the performance gap with exact softmax attention, particularly in finetuning regimes where pretrained representations are anisotropic. By combining random-feature efficiency with data-aware kernels, DARKFormer advances kernel-based attention in resource-constrained settings. Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2603.04127 [cs.LG] (or arXiv:260...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine