SourceBench: Can AI Answers Reference Quality Web Sources?
#Large Language Models #AI-generated answers #Web citations #Source quality #Content relevance #Factual accuracy #Objectivity #Freshness #Authority #Clarity #Benchmark #Human-labeled dataset #Calibrated LLM evaluator #GenAI #Search tools #arXiv #cs.AI
📌 Key Takeaways
- SourceBench evaluates cited web source quality across 100 real‑world queries of varied intent (informational, factual, argumentative, social, shopping).
- The benchmark uses an eight‑metric framework covering content relevance, factual accuracy, objectivity, freshness, authority/accountability, clarity, and other page‑level signals.
- A human‑labeled dataset and a calibrated LLM‑based evaluator are provided, showing close alignment with expert judgments.
- Eight LLMs, Google Search, and three AI search tools were evaluated over 3,996 cited sources, yielding four new insights that can inform future GenAI and web‑search research.
📖 Full Retelling
🏷️ Themes
Artificial intelligence evaluation, Web search and information retrieval, Source quality assessment, Benchmark development, Human‑in‑the‑loop evaluation
Entity Intersection Graph
No entity connections available yet for this article.
Deep Analysis
Why It Matters
SourceBench offers a systematic way to evaluate the quality of web sources cited by AI models, addressing a gap where prior tests focused only on answer correctness. By measuring relevance, accuracy, objectivity, and page signals, it helps developers build more trustworthy AI systems that rely on credible evidence.
Context & Background
- Large language models now cite web sources in answers
- Existing benchmarks prioritize correctness over source quality
- SourceBench introduces an eight-metric framework for source evaluation
- The benchmark covers 100 real-world queries across diverse intents
- Human-labeled dataset aligns with expert judgments
What Happens Next
Future research will likely integrate SourceBench into training pipelines to encourage AI models to select higher-quality sources. Developers may also refine the eight metrics or expand the query set to cover emerging domains.
Frequently Asked Questions
It evaluates cited web sources on content relevance, factual accuracy, objectivity, and page-level signals such as freshness, authority, and clarity.
The authors created a human-labeled dataset and a calibrated LLM-based evaluator that closely matches expert judgments, ensuring reliable assessment.