2/27/2026 | USA | technology | ✓ Verified - arxiv.org

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

#AMA-Bench #Long-horizon memory #AI agents #Large Language Models #Memory evaluation #Causality graph #Tool-augmented retrieval #Agent memory

📌 Key Takeaways

Researchers introduced AMA-Bench, a new benchmark for evaluating long-horizon memory in AI agents
Existing benchmarks focus on dialogue-centric interactions rather than real-world agent-environment interactions
AMA-Bench includes both real-world and synthetic agentic trajectories with appropriate QA pairs
The proposed AMA-Agent system outperforms existing memory systems by 11.16%

📖 Full Retelling

A team of researchers led by Yujie Zhao along with 11 co-authors introduced AMA-Bench (Agent Memory with Any length), a new benchmark for evaluating long-horizon memory in AI agents, in a paper submitted to arXiv on February 26, 2026, addressing the significant gap between practical applications and current evaluation standards for agent memory. The research highlights that Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex applications where long-horizon memory is critical for performance, yet existing benchmarks primarily focus on dialogue-centric, human-agent interactions rather than the continuous streams of agent-environment interactions that characterize real-world scenarios. The AMA-Bench benchmark features two key components: a set of real-world agentic trajectories across representative applications paired with expert-curated QA, and synthetic agentic trajectories that scale to arbitrary horizons with rule-based QA, providing a more comprehensive evaluation framework. Through their comprehensive study, the researchers discovered that existing memory systems underperform on AMA-Bench primarily due to lack of causality and objective information, as well as constraints from the lossy nature of similarity-based retrieval commonly employed by many memory systems.

🏷️ Themes

Artificial Intelligence, Memory Systems, Evaluation Benchmarks

📚 Related People & Topics

AI agent

Systems that perform tasks without human intervention

In the context of generative artificial intelligence, AI agents (also referred to as compound AI systems or agentic AI) are a class of intelligent agents distinguished by their ability to operate autonomously in complex environments. Agentic AI tools prioritize decision-making over content creation ...

View Profile → Wikipedia ↗

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for AI agent:

🏢 OpenAI 6 shared

🌐 Reinforcement learning 3 shared

🌐 OpenClaw 3 shared

🌐 Large language model 3 shared

🌐 Artificial intelligence 2 shared

View full profile

Mentioned Entities

AI agent

Systems that perform tasks without human intervention

Large language model

Type of machine learning model

}

Original Source

              --> Computer Science > Artificial Intelligence arXiv:2602.22769 [Submitted on 26 Feb 2026] Title: AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications Authors: Yujie Zhao , Boqin Yuan , Junbo Huang , Haocheng Yuan , Zhongming Yu , Haozhou Xu , Lanxiang Hu , Abhilash Shankarampeta , Zimeng Huang , Wentao Ni , Yuandong Tian , Jishen Zhao View a PDF of the paper titled AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications, by Yujie Zhao and 11 other authors View PDF HTML Abstract: Large Language Models are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%. Subjects: Art...
            

Read full article at source

Source

arxiv.org

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

📚 Related People & Topics

AI agent

Large language model

Entity Intersection Graph

Mentioned Entities

AI agent

Large language model

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine