2/20/2026 | USA | technology | ✓ Verified - arxiv.org

Simple Baselines are Competitive with Code Evolution

#code evolution #large language models #program synthesis #search space #agentic scaffolds #mathematical bounds #machine‑learning competitions #evaluation stochasticity #baseline comparison #domain knowledge #research best practices

📌 Key Takeaways

Simple baselines achieve performance that matches or surpasses sophisticated code‑evolution pipelines in tasks such as finding mathematical bounds, designing agentic scaffolds, and competing in machine‑learning challenges.
For mathematical‑bound problems, the primary factors governing success are the size of the search space and the domain knowledge embedded in the prompt; the search algorithm itself plays a secondary role.
In agentic‑scaffold design, high output variance coupled with small datasets leads to the selection of suboptimal scaffolds, whereas hand‑designed majority‑vote scaffolds outperform evolved ones.
The study exposes shortcomings in current code‑evolution literature, notably a lack of proper baseline comparison, excessive stochasticity in evaluation, and insufficient domain‑knowledge integration.
Authors recommend more robust evaluation protocols that reduce stochasticity while remaining economically viable, and outline best‑practice guidelines to advance rigorous code‑evolution research.

📖 Full Retelling

Yonatan Gideoni, Sebastian Risi, and Yarin Gal publish a 2026 arXiv paper titled "Simple Baselines are Competitive with Code Evolution", where they assess whether rudimentary search techniques can match sophisticated large‑language‑model‑driven code evolution methods across three benchmarks, and uncover gaps in both methodology and the research community's evaluation practices.

🏷️ Themes

Evaluation methodology, Search‑space and domain‑knowledge design, Baseline versus advanced technique comparison, Research practices in code evolution, Variance and dataset size effects

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

The study shows that straightforward baseline methods can rival advanced code evolution techniques, challenging the assumption that complex pipelines are always superior. This finding encourages researchers to focus on search space design and evaluation rigor rather than solely on algorithmic complexity. It also highlights potential cost savings and faster deployment for practical applications.

Context & Background

Code evolution uses large language models to mutate code for optimization
Previous studies often omitted comparison to simple baselines
The paper evaluates baselines across mathematical bounds, agentic scaffolds, and ML competitions

What Happens Next

Future work will likely refine evaluation protocols to reduce stochasticity and improve reproducibility. Researchers may also explore hybrid approaches that combine simple baseline strengths with selective evolutionary steps for greater efficiency.

Frequently Asked Questions

What are code evolution pipelines?

They are systems that use large language models to generate and mutate computer programs in search of better solutions.

Why did simple baselines outperform complex methods?

Because the quality of the search space and domain knowledge in prompts largely determine performance, outweighing the search algorithm itself.

How can researchers apply these findings?

By prioritizing the design of effective search spaces and robust evaluation metrics before investing in elaborate evolutionary frameworks.

}

Original Source

              --> Computer Science > Artificial Intelligence arXiv:2602.16805 [Submitted on 18 Feb 2026] Title: Simple Baselines are Competitive with Code Evolution Authors: Yonatan Gideoni , Sebastian Risi , Yarin Gal View a PDF of the paper titled Simple Baselines are Competitive with Code Evolution, by Yonatan Gideoni and 2 other authors View PDF HTML Abstract: Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem's search space and domain knowledge in the prompt are chiefly what dictate a search's performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work. Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG) Cite as: arXiv:2602.16805 [cs.AI] (or arXiv:2602.16805v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.168...
            

Read full article at source

Source

arxiv.org