3/13/2026 | USA | technology | ✓ Verified - arxiv.org

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

#INFACT #diagnostic benchmark #faithfulness #factuality #hallucinations #video-LLMs #induced hallucinations #model evaluation

📌 Key Takeaways

INFACT is a diagnostic benchmark for evaluating video-LLMs on faithfulness and factuality hallucinations.
It specifically targets induced hallucinations, where models generate incorrect information not present in the video.
The benchmark aims to improve model reliability by identifying and addressing these hallucination issues.
It provides a standardized tool for assessing video-LLM performance in factual accuracy.

📖 Full Retelling

arXiv:2603.11481v1 Announce Type: cross Abstract: Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-g

🏷️ Themes

AI Evaluation, Video-LLMs, Hallucination Detection

📚 Related People & Topics

Corporate Accountability

American nonprofit organization

Corporate Accountability (formerly INFACT, Corporate Accountability International) is a non-profit organization, founded in 1977. Their campaign headquarters are in Boston, Massachusetts, and they have offices in Oakland, California; Seattle, Washington; and Bogotá, Colombia.

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Corporate Accountability:

🌐 List of Survivor (American TV series) contestants 1 shared

🌐 Food industry 1 shared

🌐 Supply chain 1 shared

🌐 Cost accounting 1 shared

👤 H. B. Reese 1 shared

View full profile

Mentioned Entities

Corporate Accountability

American nonprofit organization

Deep Analysis

Why It Matters

This research matters because it addresses a critical reliability issue in AI systems that process video content. Video-LLMs are increasingly used in applications like content moderation, educational tools, and automated video analysis, where factual accuracy is essential. The benchmark helps developers identify and reduce hallucinations where models generate plausible but incorrect information, which could lead to misinformation if deployed without proper safeguards. This affects AI researchers, technology companies implementing video AI, and end-users who rely on these systems for accurate information.

Context & Background

Video-LLMs (Large Language Models for video) combine visual understanding with language generation to describe and analyze video content
Hallucinations in AI refer to models generating confident but incorrect or fabricated information that isn't supported by input data
Previous benchmarks have focused on text or image models, but video presents unique challenges with temporal and spatial reasoning
The 'faithfulness' metric measures how well model outputs align with actual video content versus invented details
Factuality hallucinations specifically concern incorrect factual claims about objects, actions, or events in videos

What Happens Next

Researchers will likely use INFACT to evaluate existing Video-LLMs and develop improved training techniques to reduce hallucinations. We can expect research papers comparing model performance on this benchmark within 3-6 months, followed by new model architectures specifically designed to address these faithfulness issues. Technology companies may incorporate INFACT testing into their development pipelines before deploying video AI systems to production environments.

Frequently Asked Questions

What exactly are 'induced faithfulness hallucinations' in Video-LLMs?

Induced faithfulness hallucinations occur when Video-LLMs generate descriptions or analyses that contradict what's actually shown in the video. These are 'induced' because they're triggered by specific challenging video scenarios designed to test model limitations, such as complex actions, subtle details, or ambiguous visual information.

How does INFACT differ from existing AI evaluation benchmarks?

INFACT specifically targets video understanding systems rather than text or image models. It focuses on temporal consistency across video frames and spatial relationships within scenes, creating controlled scenarios that reveal when models invent details not present in the visual content. The benchmark includes diverse video types and difficulty levels to comprehensively test model reliability.

Why are hallucinations particularly dangerous in video analysis systems?

Video hallucinations are especially problematic because videos are often treated as objective evidence. If an AI system confidently describes events that didn't occur or misidentifies people/actions, it could lead to false accusations, incorrect educational content, or flawed security analysis. The visual nature makes incorrect outputs more convincing to users who trust video as reliable documentation.

Who developed the INFACT benchmark and what's their goal?

While the specific researchers aren't named in this summary, such benchmarks typically come from academic AI research groups or industry labs focused on multimodal AI. Their goal is to establish standardized testing that pushes Video-LLM development toward greater reliability, similar to how benchmarks have driven progress in other AI domains like natural language processing.

Can this benchmark help improve real-world video AI applications?

Yes, INFACT provides concrete metrics that developers can use to compare models and track improvement. Applications like automated video captioning for accessibility, content moderation for platforms, surveillance analysis, and educational video tools would all benefit from models that score well on faithfulness and factuality metrics, reducing errors in production systems.

}

Original Source

              arXiv:2603.11481v1 Announce Type: cross 
Abstract: Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-g
            

Read full article at source

Source

arxiv.org

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

📚 Related People & Topics

Corporate Accountability

Entity Intersection Graph

Mentioned Entities

Corporate Accountability

Deep Analysis

Why It Matters

Context & Background

What Happens Next

Frequently Asked Questions

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine