KnapSpec reformulates draft model selection as a knapsack problem to maximize throughput
The framework dynamically adapts to computational overhead in long-context scenarios
It achieves up to 1.47x speedup over existing SSD methods
KnapSpec requires no additional training and maintains output distribution fidelity
The method provides theoretical analysis for token acceptance rate prediction
📖 Full Retelling
Researchers Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, and Insu Han introduced KnapSpec, a novel training-free framework for accelerating large language model inference, through their paper submitted to arXiv on February 23, 2026, addressing limitations in existing self-speculative decoding methods that fail to account for dynamic computational overhead in long-context scenarios. Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create efficient draft models, yet current approaches often rely on static heuristics that ignore the varying computational demands of attention mechanisms when processing long texts. This becomes particularly problematic as language models increasingly handle extended contexts where attention calculations become computationally expensive and variable. KnapSpec innovatively reformulates the draft model selection process as a knapsack problem, aiming to maximize tokens-per-time throughput by decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length. The framework dynamically identifies optimal draft configurations through a parallel dynamic programming algorithm, while also providing theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for token acceptance rate. Experiments on Qwen3 and Llama3 models demonstrated that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks without requiring additional training or compromising the target model's output distribution.
🏷️ Themes
AI acceleration, Computational efficiency, Model optimization, Hardware adaptation
The knapsack problem is the following problem in combinatorial optimization:
Given a set of items, each with a weight and a value, determine which items to include in the collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.
It deriv...
Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, such as aerospace engineering and economics.
In both contexts it refers to simplifying a complicated pro...
No entity connections available yet for this article.
Original Source
--> Computer Science > Machine Learning arXiv:2602.20217 [Submitted on 23 Feb 2026] Title: KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem Authors: Seongjin Cha , Gyuwan Kim , Dongsu Han , Tao Yang , Insu Han View a PDF of the paper titled KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem, by Seongjin Cha and 4 other authors View PDF HTML Abstract: Self-speculative decoding accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution. Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20217 [cs.LG] (or arXiv:2602.20217v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20217 Focus to learn more arXiv-issued DOI via DataCite Submission history From...