2/20/2026 | USA | technology | ✓ Verified - arxiv.org

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

#AI, Machine Intelligence, General Intelligence, AI GameStore, Human Games, Large Language Models, Vision‑Language Models, Open‑Ended Benchmark, Apple App Store, Steam, World‑Model Learning, Memory, Planning, Benchmark Saturation

📌 Key Takeaways

Authors propose a new evaluation paradigm: assessing AI systems by their ability to learn and play all human‑designed games.
Definition of a *human game* as any game created for human players.
Creation of AI GameStore, a platform that automatically synthesizes new game instances using LLMs and human‑in‑the‑loop validation.
Proof‑of‑concept generation of 100 games sourced from popular digital marketplaces (Apple App Store, Steam).
Evaluation of seven leading vision‑language models on short episodes, revealing scores below 10 % of human averages.
Results highlight challenges for AI in world‑model learning, memory, and planning within complex game environments.
Outline of future extensions to broaden the game universe, improve automation, and make the benchmark a ubiquitous tool for AI progress.

📖 Full Retelling

The paper “AI Gamestore: Scalable, Open‑Ended Evaluation of Machine General Intelligence with Human Games” was written by Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández‑Orallo, Phillip Isola, Samuel J. Gershman, and Joshua B. Tenenbaum. It was submitted to the arXiv preprint server on 19 February 2026, where the authors present a new framework for measuring whether artificial systems can attain human‑like general intelligence by playing the widest possible set of games designed for people. Their motivation is that traditional AI benchmarks are narrow, static, and quickly saturated, so a more comprehensive test would involve agents learning to play *all conceivable human games*. To achieve this, the authors build the AI GameStore, an open‑ended platform that uses large language models and human experts to generate novel game instances sourced from platforms such as the Apple App Store and Steam. As a proof of concept, they automatically produced 100 games and evaluated seven state‑of‑the‑art vision‑language models on short play episodes. The evaluation showed that even the best models scored less than ten percent of the human benchmark on most games, especially those demanding world‑model learning, memory, and planning. The paper concludes by outlining next steps to expand the GameStore into a practical, scalable benchmark for machine general intelligence.

🏷️ Themes

Artificial Intelligence evaluation, Generalized game-playing, Open‑ended benchmarking, Human‑in‑the‑loop system design, Large language models, AI‑generated entertainment content, Benchmark saturation & scalability

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

The AI GameStore offers a scalable, open-ended benchmark that tests AI systems on a wide range of human-designed games, addressing the narrowness of current benchmarks. By comparing AI performance to human players, it provides a more realistic measure of general intelligence progress.

Context & Background

Traditional AI benchmarks are narrow and quickly saturate
There is a growing need to evaluate general intelligence across diverse tasks
The AI GameStore uses LLMs to generate and evaluate a broad set of human games

What Happens Next

Future work will expand the game library, refine the evaluation protocol, and encourage the community to contribute new games and models. The platform aims to become a standard for measuring progress toward human-like general intelligence.

Frequently Asked Questions

What is the AI GameStore?

It is a platform that uses large language models and human input to synthesize new human-designed games and evaluate AI models against human players.

How does it differ from existing benchmarks?

Unlike static benchmarks, it covers all conceivable human games, is scalable, open-ended, and includes human-in-the-loop evaluation.

}

Original Source

              --> Computer Science > Artificial Intelligence arXiv:2602.17594 [Submitted on 19 Feb 2026] Title: AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games Authors: Lance Ying , Ryan Truong , Prafull Sharma , Kaiya Ivy Zhao , Nathan Cloos , Kelsey R. Allen , Thomas L. Griffiths , Katherine M. Collins , José Hernández-Orallo , Phillip Isola , Samuel J. Gershman , Joshua B. Tenenbaum View a PDF of the paper titled AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games, by Lance Ying and 11 other authors View PDF Abstract: Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and ev...
            

Read full article at source

Source

arxiv.org