2/27/2026 | USA | technology | ✓ Verified - arxiv.org

HubScan: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems

#HubScan #RAG systems #Hubness poisoning #Vector similarity search #AI security #Adversarial attacks #Content filtering #Open-source security

📌 Key Takeaways

HubScan is an open-source security scanner for detecting hubness poisoning in RAG systems
It uses a multi-detector architecture with statistical analysis, cluster spread analysis, stability testing, and domain-aware detection
HubScan achieves 90% recall at 0.2% alert budget and 100% recall at 0.4%
Domain-scoped scanning recovers 100% of targeted attacks that evade global detection
The tool has been validated on 1 million real web documents from MS MARCO

📖 Full Retelling

Researchers Idan Habler, Vineeth Sai Narajala, Stav Koren, Amy Chang, and Tiffany Saade introduced HubScan, an open-source security scanner designed to detect hubness poisoning in Retrieval-Augmented Generation (RAG) systems, in a paper submitted to arXiv on February 25, 2026, addressing a critical security vulnerability where certain items appear disproportionately in search results, potentially allowing malicious actors to introduce harmful content or manipulate system outputs. The research highlights that while RAG systems are essential to contemporary AI applications, allowing large language models to obtain external knowledge through vector similarity search, they face a significant security flaw known as 'hubness' - items that frequently appear in top-k retrieval results for a disproportionately high number of varied queries. This vulnerability can be exploited to introduce harmful content, alter search rankings, bypass content filtering, and decrease system performance. HubScan employs a multi-detector architecture that integrates several innovative approaches to identify malicious hubs, including robust statistical hubness detection using median/MAD-based z-scores, cluster spread analysis to assess cross-cluster retrieval patterns, stability testing under query perturbations, and domain-aware and modality-aware detection for category-specific and cross-modal attacks. The solution is versatile, accommodating multiple vector databases such as FAISS, Pinecone, Qdrant, and Weaviate, while offering various retrieval techniques including vector similarity, hybrid search, and lexical matching with reranking capabilities. This comprehensive approach allows for detection of sophisticated attacks that might otherwise remain hidden in RAG systems. The researchers evaluated HubScan on several adversarial hubness benchmarks including Food-101, MS-COCO, and FiQA, constructed using state-of-the-art gradient-optimized and centroid-based hub generation methods. Results demonstrated impressive performance, with HubScan achieving 90% recall at a 0.2% alert budget and 100% recall at 0.4%, with adversarial hubs ranking above the 99.8th percentile. Notably, domain-scoped scanning recovered 100% of targeted attacks that evade global detection. In production validation using 1 million real web documents from MS MARCO, the tool showed significant score separation between clean documents and adversarial content, providing a practical, extensible framework for detecting hubness threats in production RAG systems.

🏷️ Themes

AI Security, Retrieval-Augmented Generation, Cybersecurity Threats, Vector Database Protection

Entity Intersection Graph

No entity connections available yet for this article.

}

Original Source

              --> Computer Science > Cryptography and Security arXiv:2602.22427 [Submitted on 25 Feb 2026] Title: HubScan: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems Authors: Idan Habler , Vineeth Sai Narajala , Stav Koren , Amy Chang , Tiffany Saade View a PDF of the paper titled HubScan: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems, by Idan Habler and 4 other authors View PDF HTML Abstract: Retrieval-Augmented Generation systems are essential to contemporary AI applications, allowing large language models to obtain external knowledge via vector similarity search. Nevertheless, these systems encounter a significant security flaw: hubness - items that frequently appear in the top-k retrieval results for a disproportionately high number of varied queries. These hubs can be exploited to introduce harmful content, alter search rankings, bypass content filtering, and decrease system performance. We introduce hubscan, an open-source security scanner that evaluates vector indices and embeddings to identify hubs in RAG systems. Hubscan presents a multi-detector architecture that integrates: (1) robust statistical hubness detection utilizing median/MAD-based z-scores, (2) cluster spread analysis to assess cross-cluster retrieval patterns, (3) stability testing under query perturbations, and (4) domain-aware and modality-aware detection for category-specific and cross-modal attacks. Our solution accommodates several vector databases (FAISS, Pinecone, Qdrant, Weaviate) and offers versatile retrieval techniques, including vector similarity, hybrid search, and lexical matching with reranking capabilities. We evaluate hubscan on Food-101, MS-COCO, and FiQA adversarial hubness benchmarks constructed using state-of-the-art gradient-optimized and centroid-based hub generation methods. hubscan achieves 90% recall at a 0.2% alert budget and 100% recall at 0.4%, with adversarial hubs ranking above the 99.8th percentile. Domain-scoped scanning reco...
            

Read full article at source

Source

arxiv.org

HubScan: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

Entity Intersection Graph

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine