Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild
#Multimodal LLMs #Surveillance #Anomaly Detection #Zero-Shot Learning #Real-World Applications
π Key Takeaways
- Multimodal LLMs show potential for surveillance but face significant real-world limitations.
- Zero-shot anomaly detection in uncontrolled environments is not yet reliable for practical use.
- Current models struggle with contextual understanding and high-stakes accuracy requirements.
- The study highlights a gap between academic benchmarks and operational surveillance needs.
π Full Retelling
arXiv:2603.04727v1 Announce Type: cross
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the
π·οΈ Themes
AI Surveillance, Technology Limitations
π Related People & Topics
Surveillance
Monitoring something for the purposes of influencing, protecting, or suppressing it
Surveillance is the systematic observation and monitoring of a person, population, or location, with the purpose of information-gathering, influencing, managing, or directing. It is widely used by governments for a variety of reasons, such as law enforcement, national security, and information aware...
Entity Intersection Graph
Connections for Surveillance:
π’
Cellebrite
1 shared
π
Human rights
1 shared
π
Phone hacking
1 shared
π€
Citizen Lab
1 shared
π
Illegal drug trade
1 shared
Mentioned Entities
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2603.04727 [Submitted on 5 Mar 2026] Title: Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild Authors: Shanle Yao , Armin Danesh Pazho , Narges Rashvand , Hamed Tabkhi View a PDF of the paper titled Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild, by Shanle Yao and 3 other authors View PDF HTML Abstract: Multimodal large language models have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning. Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2603...
Read full article at source