Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
#Vibe Code Bench #AI models #web application development #benchmark #software development #coding #evaluation #end-to-end
📌 Key Takeaways
- Vibe Code Bench is a new benchmark for evaluating AI models on end-to-end web application development.
- It assesses AI capabilities in generating functional web applications from high-level descriptions.
- The benchmark aims to measure progress in AI-driven software development and coding tasks.
- It provides a standardized framework for comparing different AI models' performance in real-world web development scenarios.
📖 Full Retelling
arXiv:2603.04601v1 Announce Type: cross
Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous
🏷️ Themes
AI Evaluation, Web Development
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Software Engineering arXiv:2603.04601 [Submitted on 4 Mar 2026] Title: Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development Authors: Hung Tran , Langston Nashold , Rayan Krishnan , Antoine Bigeard , Alex Gu View a PDF of the paper titled Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development, by Hung Tran and 4 other authors View PDF HTML Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results. Comments: Live leaderboard hosted here: this https URL . Preprint, currently under review. Benchmark first released Nov 2025 Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACM classes: I.2.7 Cite as: arXiv:2603.04601 [cs.SE] (or arXiv:2603.04601v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603....
Read full article at source