#AI Efficiency

Latest news articles tagged with "AI Efficiency". Follow the timeline of events, related topics, and entities.

Articles (30)

🇺🇸 Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models — 09/04/2026 [USA]
arXiv:2604.06871v1 Announce Type: cross Abstract: Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengt...
Related: #Model Architecture, #Computational Cost
🇺🇸 The Detection--Extraction Gap: Models Know the Answer Before They Can Say It — 09/04/2026 [USA]
arXiv:2604.06613v1 Announce Type: cross Abstract: Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three be...
Related: #Reasoning Models, #Computational Waste
🇺🇸 Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models — 23/03/2026 [USA]
arXiv:2603.20161v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaran...
Related: #Uncertainty Quantification
🇺🇸 Utility-Guided Agent Orchestration for Efficient LLM Tool Use — 23/03/2026 [USA]
arXiv:2603.19896v1 Announce Type: new Abstract: Tool-using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but...
Related: #Tool Orchestration
🇺🇸 MineDraft: A Framework for Batch Parallel Speculative Decoding — 20/03/2026 [USA]
arXiv:2603.18016v1 Announce Type: cross Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently ver...
Related: #Parallel Computing
🇺🇸 LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling — 20/03/2026 [USA]
arXiv:2603.19100v1 Announce Type: new Abstract: Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundatio...
Related: #EEG Analysis
🇺🇸 CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution — 20/03/2026 [USA]
arXiv:2603.18513v1 Announce Type: cross Abstract: In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) imp...
Related: #Medical Imaging
🇺🇸 HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering — 20/03/2026 [USA]
arXiv:2603.18558v1 Announce Type: cross Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language mode...
Related: #Video Analysis
🇺🇸 RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference — 19/03/2026 [USA]
arXiv:2603.17891v1 Announce Type: cross Abstract: Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enf...
Related: #Model Optimization
🇺🇸 InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning — 19/03/2026 [USA]
arXiv:2603.17310v1 Announce Type: new Abstract: Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computat...
Related: #Reasoning Optimization
🇺🇸 DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge — 19/03/2026 [USA]
arXiv:2603.17275v1 Announce Type: cross Abstract: Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input...
Related: #Edge Computing
🇺🇸 Empirical Recipes for Efficient and Compact Vision-Language Models — 19/03/2026 [USA]
arXiv:2603.16987v1 Announce Type: cross Abstract: Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fal...
Related: #Model Optimization
🇺🇸 Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs — 19/03/2026 [USA]
arXiv:2603.16932v1 Announce Type: cross Abstract: Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency...
Related: #Computer Vision
🇺🇸 Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents — 18/03/2026 [USA]
arXiv:2603.15658v1 Announce Type: new Abstract: Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducin...
Related: #Memory Management
🇺🇸 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models — 18/03/2026 [USA]
arXiv:2603.15970v1 Announce Type: cross Abstract: Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and c...
Related: #Cost Reduction
🇺🇸 Parallel In-context Learning for Large Vision Language Models — 18/03/2026 [USA]
arXiv:2603.16092v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. Whil...
Related: #Multimodal Learning
🇺🇸 FastODT: A tree-based framework for efficient continual learning — 17/03/2026 [USA]
arXiv:2603.13276v1 Announce Type: cross Abstract: Machine learning models deployed in real-world settings must operate under evolving data distributions and constrained computational resources. This ...
Related: #Machine Learning
🇺🇸 RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse — 17/03/2026 [USA]
arXiv:2603.13289v1 Announce Type: cross Abstract: The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However,...
Related: #LLM Collaboration
🇺🇸 Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference — 17/03/2026 [USA]
arXiv:2603.13426v1 Announce Type: cross Abstract: Semantic routers in LLM inference gateways select tools in the critical request path, where every millisecond of added latency compounds across milli...
Related: #Tool Selection
🇺🇸 ICaRus: Identical Cache Reuse for Efficient Multi Model Inference — 17/03/2026 [USA]
arXiv:2603.13281v1 Announce Type: cross Abstract: Multi model inference has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios...
Related: #Cache Optimization
🇺🇸 LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing — 16/03/2026 [USA]
arXiv:2603.12645v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their dep...
Related: #Neural Networks
🇺🇸 Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation — 16/03/2026 [USA]
arXiv:2603.13017v1 Announce Type: new Abstract: Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study person...
Related: #Memory Compression
🇺🇸 ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning — 16/03/2026 [USA]
arXiv:2603.13019v1 Announce Type: cross Abstract: Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve comple...
Related: #Reinforcement Learning
🇺🇸 TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning — 16/03/2026 [USA]
arXiv:2603.12529v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to g...
Related: #Reasoning Optimization
🇺🇸 Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents — 16/03/2026 [USA]
arXiv:2603.12634v1 Announce Type: cross Abstract: Test-time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, a...
Related: #LLM Optimization
🇺🇸 MXNorm: Reusing MXFP block scales for efficient tensor normalisation — 16/03/2026 [USA]
arXiv:2603.13180v1 Announce Type: cross Abstract: Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accel...
Related: #Tensor Normalization
🇺🇸 Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity — 16/03/2026 [USA]
arXiv:2603.12707v1 Announce Type: cross Abstract: Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while langu...
Related: #GPU Optimization
🇺🇸 When Drafts Evolve: Speculative Decoding Meets Online Learning — 16/03/2026 [USA]
arXiv:2603.12617v1 Announce Type: cross Abstract: Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidl...
Related: #Machine Learning
🇺🇸 Test-Time Strategies for More Efficient and Accurate Agentic RAG — 16/03/2026 [USA]
arXiv:2603.12396v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., ...
Related: #RAG Optimization
🇺🇸 Few-for-Many Personalized Federated Learning — 13/03/2026 [USA]
arXiv:2603.11992v1 Announce Type: new Abstract: Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data pr...
Related: #Machine Learning, #Data Privacy

Key Entities (8)

About the topic: AI Efficiency

The topic "AI Efficiency" aggregates 30+ news articles from various countries.