2/16/2026 | USA | technology | ✓ Verified - arxiv.org

From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

#agentic AI systems #prompt-to-app #benchmark evaluation #full-stack web applications #human-centered criteria #AI software development #arXiv research

📌 Key Takeaways

Researchers developed a new benchmark for evaluating AI systems that generate web applications from natural language prompts
The benchmark addresses misaligned metrics between visual polish, functional correctness, and user trust
The human-centered approach focuses on realistic evaluation criteria rather than just technical metrics
This work provides clarity for comparing different AI-powered development tools in the market

📖 Full Retelling

Researchers from the academic community introduced a new benchmark for evaluating agentic AI systems capable of generating full-stack web applications from natural language prompts in December 2025, addressing the growing challenge of assessing these emerging technologies when visual polish, functional correctness, and user trust are often misaligned metrics. The paper, titled 'From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems' and published on arXiv as version 2512.18080v2, represents a significant contribution to the field of AI-assisted software development. As agentic AI systems continue to evolve, the ability to generate complete web applications from simple prompts represents a paradigm shift in how software is created and maintained, yet the current lack of standardized evaluation metrics makes it difficult for researchers and developers to objectively compare different approaches. The benchmark focuses on human-centered evaluation criteria, recognizing that traditional technical metrics alone may not capture the full user experience of AI-generated applications, thereby establishing a comprehensive framework for assessment that provides clarity on how existing prompt-to-app tools perform under realistic conditions.

🏷️ Themes

AI evaluation, Software development, Human-centered design

Entity Intersection Graph

No entity connections available yet for this article.

}

Original Source

              arXiv:2512.18080v2 Announce Type: replace-cross 
Abstract: Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper
            

Read full article at source

Source

arxiv.org

From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

Entity Intersection Graph

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine