SP
BravenNow
S2O: Early Stopping for Sparse Attention via Online Permutation
| USA | technology | ✓ Verified - arxiv.org

S2O: Early Stopping for Sparse Attention via Online Permutation

#Sparse Attention #Early Stopping #Online Permutation #FlashAttention #Long-Context Inference #Sequence Length #Computational Efficiency #Llama-3.1-8B

📌 Key Takeaways

  • S2O enables early stopping for sparse attention through online permutation
  • The method addresses quadratic scaling limitations of attention mechanisms
  • S2O achieves significant speedups while preserving accuracy
  • The approach breaks through previous sparsity ceilings in attention mechanisms

📖 Full Retelling

Researchers Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, and Xing Wang introduced S2O, a novel method for early stopping in sparse attention via online permutation, in a paper submitted to arXiv on February 26, 2026, addressing the quadratic scaling problem of attention with sequence length that fundamentally limits long-context inference in machine learning models. The research tackles a critical challenge in large language models where attention mechanisms scale quadratically with input size, creating computational bottlenecks for processing long sequences. Existing block-granularity sparsification techniques have reached an intrinsic sparsity ceiling, making further performance improvements difficult despite sophisticated engineering approaches. S2O breaks through this limitation by drawing inspiration from virtual-to-physical address mapping in memory systems, allowing for more flexible and efficient processing of attention patterns. By transforming explicit permutation into an online, index-guided loading policy, the method concentrates computational resources on high-priority blocks while maintaining minimal overhead. The researchers demonstrate that S2O achieves remarkable performance improvements, reducing mean squared error by 3.82× at matched sparsity and compute density by 3.31× at matched MSE on the Llama-3.1-8B model with 128K context length. Most impressively, S2O preserves end-to-end accuracy while delivering 7.51× faster attention processing and 3.81× end-to-end speedups, representing a significant advancement in efficient large-scale language model inference.

🏷️ Themes

Machine Learning Optimization, Attention Mechanisms, Computational Efficiency

📚 Related People & Topics

Transformer (deep learning)

Transformer (deep learning)

Algorithm for modelling sequential data

In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each tok...

View Profile → Wikipedia ↗

Early stopping

Method in machine learning

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent. Such methods update the model to make it better fit the training data with each iteration. Up to a point, this improves the model's perf...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Transformer (deep learning):

🌐 Computational resource 1 shared
🌐 MLP 1 shared
View full profile
Original Source
--> Computer Science > Machine Learning arXiv:2602.22575 [Submitted on 26 Feb 2026] Title: S2O: Early Stopping for Sparse Attention via Online Permutation Authors: Yu Zhang , Songwei Liu , Chenqian Yan , Sheng Lin , Beichen Ning , Fangmin Chen , Xing Wang View a PDF of the paper titled S2O: Early Stopping for Sparse Attention via Online Permutation, by Yu Zhang and 6 other authors View PDF HTML Abstract: Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget. As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a 128K context, S2O reduces single-operator MSE by 3.82$\times$ at matched sparsity, and reduces prefill compute density by 3.31$\times$ at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves 7.51$\tim...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine