ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation
#ManiBench #visual-logic drift #syntactic hallucinations #Manim #code generation #benchmark #AI evaluation
📌 Key Takeaways
- ManiBench is a new benchmark designed to evaluate visual-logic drift in Manim code generation.
- It specifically tests for syntactic hallucinations in generated Manim code.
- The benchmark aims to improve the reliability of AI-generated code for visual animations.
- ManiBench addresses challenges in ensuring code accurately reflects intended visual outcomes.
📖 Full Retelling
arXiv:2603.13251v1 Announce Type: new
Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs)
🏷️ Themes
AI Benchmarking, Code Generation
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
arXiv:2603.13251v1 Announce Type: new
Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs)
Read full article at source