#AI Efficiency
Latest news articles tagged with "AI Efficiency". Follow the timeline of events, related topics, and entities.
Articles (30)
-
πΊπΈ Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
[USA]
arXiv:2604.06871v1 Announce Type: cross Abstract: Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengt...
Related: #Model Architecture, #Computational Cost -
πΊπΈ The Detection--Extraction Gap: Models Know the Answer Before They Can Say It
[USA]
arXiv:2604.06613v1 Announce Type: cross Abstract: Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three be...
Related: #Reasoning Models, #Computational Waste -
πΊπΈ Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
[USA]
arXiv:2603.20161v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaran...
Related: #Uncertainty Quantification -
πΊπΈ Utility-Guided Agent Orchestration for Efficient LLM Tool Use
[USA]
arXiv:2603.19896v1 Announce Type: new Abstract: Tool-using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but...
Related: #Tool Orchestration -
πΊπΈ MineDraft: A Framework for Batch Parallel Speculative Decoding
[USA]
arXiv:2603.18016v1 Announce Type: cross Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently ver...
Related: #Parallel Computing -
πΊπΈ LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling
[USA]
arXiv:2603.19100v1 Announce Type: new Abstract: Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundatio...
Related: #EEG Analysis -
πΊπΈ CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution
[USA]
arXiv:2603.18513v1 Announce Type: cross Abstract: In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) imp...
Related: #Medical Imaging -
πΊπΈ HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
[USA]
arXiv:2603.18558v1 Announce Type: cross Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language mode...
Related: #Video Analysis -
πΊπΈ RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
[USA]
arXiv:2603.17891v1 Announce Type: cross Abstract: Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enf...
Related: #Model Optimization -
πΊπΈ InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning
[USA]
arXiv:2603.17310v1 Announce Type: new Abstract: Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computat...
Related: #Reasoning Optimization -
πΊπΈ DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge
[USA]
arXiv:2603.17275v1 Announce Type: cross Abstract: Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input...
Related: #Edge Computing -
πΊπΈ Empirical Recipes for Efficient and Compact Vision-Language Models
[USA]
arXiv:2603.16987v1 Announce Type: cross Abstract: Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fal...
Related: #Model Optimization -
πΊπΈ Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
[USA]
arXiv:2603.16932v1 Announce Type: cross Abstract: Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency...
Related: #Computer Vision -
πΊπΈ Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents
[USA]
arXiv:2603.15658v1 Announce Type: new Abstract: Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducin...
Related: #Memory Management -
πΊπΈ 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
[USA]
arXiv:2603.15970v1 Announce Type: cross Abstract: Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and c...
Related: #Cost Reduction -
πΊπΈ Parallel In-context Learning for Large Vision Language Models
[USA]
arXiv:2603.16092v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. Whil...
Related: #Multimodal Learning -
πΊπΈ FastODT: A tree-based framework for efficient continual learning
[USA]
arXiv:2603.13276v1 Announce Type: cross Abstract: Machine learning models deployed in real-world settings must operate under evolving data distributions and constrained computational resources. This ...
Related: #Machine Learning -
πΊπΈ RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
[USA]
arXiv:2603.13289v1 Announce Type: cross Abstract: The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However,...
Related: #LLM Collaboration -
πΊπΈ Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference
[USA]
arXiv:2603.13426v1 Announce Type: cross Abstract: Semantic routers in LLM inference gateways select tools in the critical request path, where every millisecond of added latency compounds across milli...
Related: #Tool Selection -
πΊπΈ ICaRus: Identical Cache Reuse for Efficient Multi Model Inference
[USA]
arXiv:2603.13281v1 Announce Type: cross Abstract: Multi model inference has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios...
Related: #Cache Optimization -
πΊπΈ LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
[USA]
arXiv:2603.12645v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their dep...
Related: #Neural Networks -
πΊπΈ Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation
[USA]
arXiv:2603.13017v1 Announce Type: new Abstract: Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study person...
Related: #Memory Compression -
πΊπΈ ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning
[USA]
arXiv:2603.13019v1 Announce Type: cross Abstract: Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve comple...
Related: #Reinforcement Learning -
πΊπΈ TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
[USA]
arXiv:2603.12529v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to g...
Related: #Reasoning Optimization -
πΊπΈ Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents
[USA]
arXiv:2603.12634v1 Announce Type: cross Abstract: Test-time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, a...
Related: #LLM Optimization -
πΊπΈ MXNorm: Reusing MXFP block scales for efficient tensor normalisation
[USA]
arXiv:2603.13180v1 Announce Type: cross Abstract: Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accel...
Related: #Tensor Normalization -
πΊπΈ Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
[USA]
arXiv:2603.12707v1 Announce Type: cross Abstract: Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while langu...
Related: #GPU Optimization -
πΊπΈ When Drafts Evolve: Speculative Decoding Meets Online Learning
[USA]
arXiv:2603.12617v1 Announce Type: cross Abstract: Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidl...
Related: #Machine Learning -
πΊπΈ Test-Time Strategies for More Efficient and Accurate Agentic RAG
[USA]
arXiv:2603.12396v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., ...
Related: #RAG Optimization -
πΊπΈ Few-for-Many Personalized Federated Learning
[USA]
arXiv:2603.11992v1 Announce Type: new Abstract: Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data pr...
Related: #Machine Learning, #Data Privacy
Key Entities (8)
- Generative engine optimization (2 news)
- Artificial intelligence (2 news)
- Ramp (disambiguation) (1 news)
- Large language model (1 news)
- D.A.N.C.E. (1 news)
- Energy efficiency (1 news)
- Electroencephalography (1 news)
- Mamba (1 news)
About the topic: AI Efficiency
The topic "AI Efficiency" aggregates 30+ news articles from various countries.