CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions
#CodeHacker#Automated Testing#Adversarial Test Cases#Competitive Programming#Large Language Models#Vulnerability Detection#arXiv#Software Engineering
📌 Key Takeaways
CodeHacker is an automated framework for generating adversarial test cases to detect vulnerabilities in code
It uses a multi-strategy approach including stress testing and logic-specific targeting
The framework includes a Calibration Phase to ensure reliability through self-generated probes
CodeHacker improves True Negative Rate and provides superior training data for AI models
📖 Full Retelling
Jingwei Shi and four other researchers from Xinxiang Yin, Jing Huang, Jinman Zhao, and Shengyu Tao developed CodeHacker, an automated agent framework for generating adversarial test cases, which they submitted to arXiv on February 23, 2026, to address critical gaps in evaluating Large Language Models for code generation. The researchers identified that existing benchmarks often fail to cover subtle corner cases, allowing incorrect solutions to pass undetected, potentially compromising the reliability of automated code evaluation systems.
CodeHacker mimics the "hack" mechanism commonly used in competitive programming environments, employing a multi-strategy approach that includes stress testing, anti-hash attacks, and logic-specific targeting to deliberately break program submissions. The framework introduces an innovative "Calibration Phase" where the agent iteratively refines its own Validator and Checker through self-generated adversarial probes before evaluating contestant solutions. This self-improvement mechanism ensures the validity and reliability of the attacks generated by the system.
The researchers demonstrated that CodeHacker significantly improves the True Negative Rate of existing datasets, effectively filtering out incorrect solutions that were previously accepted by conventional testing methods. Furthermore, they discovered that the adversarial cases generated by CodeHacker serve as superior training data, boosting the performance of reinforcement learning-trained models on benchmarks like LiveCodeBench. This dual benefit—improved detection of vulnerabilities and enhanced training data—positions CodeHacker as a valuable tool for both competitive programming platforms and AI research focused on code generation.
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Competitive programming or sport programming is a mind sport involving participants trying to program according to provided specifications. The contests are usually held over the Internet or a local network. Competitive programming is recognized and supported by several multinational software and In...
--> Computer Science > Software Engineering arXiv:2602.20213 [Submitted on 23 Feb 2026] Title: CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions Authors: Jingwei Shi , Xinxiang Yin , Jing Huang , Jinman Zhao , Shengyu Tao View a PDF of the paper titled CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions, by Jingwei Shi and 4 other authors View PDF HTML Abstract: The evaluation of Large Language Models for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant this http URL demonstrate that CodeHacker significantly improves the True Negative Rate of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench. Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2602.20213 [cs.SE] (or arXiv:2602.20213v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.20213 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jingwei S...