SP
BravenNow
SourceBench: Can AI Answers Reference Quality Web Sources?
| USA | technology | ✓ Verified - arxiv.org

SourceBench: Can AI Answers Reference Quality Web Sources?

#Large Language Models #AI-generated answers #Web citations #Source quality #Content relevance #Factual accuracy #Objectivity #Freshness #Authority #Clarity #Benchmark #Human-labeled dataset #Calibrated LLM evaluator #GenAI #Search tools #arXiv #cs.AI

📌 Key Takeaways

  • SourceBench evaluates cited web source quality across 100 real‑world queries of varied intent (informational, factual, argumentative, social, shopping).
  • The benchmark uses an eight‑metric framework covering content relevance, factual accuracy, objectivity, freshness, authority/accountability, clarity, and other page‑level signals.
  • A human‑labeled dataset and a calibrated LLM‑based evaluator are provided, showing close alignment with expert judgments.
  • Eight LLMs, Google Search, and three AI search tools were evaluated over 3,996 cited sources, yielding four new insights that can inform future GenAI and web‑search research.

📖 Full Retelling

The paper "SourceBench: Can AI Answers Reference Quality Web Sources?" is authored by Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, and Yiying Zhang, submitted to arXiv’s Computer Science – Artificial Intelligence (cs.AI) archive on 18 February 2026. It introduces a new benchmark, SourceBench, designed to evaluate the quality of web sources cited by large language models (LLMs) and search tools, aiming to move beyond answer correctness toward assessing evidence quality in AI-generated responses.

🏷️ Themes

Artificial intelligence evaluation, Web search and information retrieval, Source quality assessment, Benchmark development, Human‑in‑the‑loop evaluation

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

SourceBench offers a systematic way to evaluate the quality of web sources cited by AI models, addressing a gap where prior tests focused only on answer correctness. By measuring relevance, accuracy, objectivity, and page signals, it helps developers build more trustworthy AI systems that rely on credible evidence.

Context & Background

  • Large language models now cite web sources in answers
  • Existing benchmarks prioritize correctness over source quality
  • SourceBench introduces an eight-metric framework for source evaluation
  • The benchmark covers 100 real-world queries across diverse intents
  • Human-labeled dataset aligns with expert judgments

What Happens Next

Future research will likely integrate SourceBench into training pipelines to encourage AI models to select higher-quality sources. Developers may also refine the eight metrics or expand the query set to cover emerging domains.

Frequently Asked Questions

What does SourceBench measure?

It evaluates cited web sources on content relevance, factual accuracy, objectivity, and page-level signals such as freshness, authority, and clarity.

How was the benchmark validated?

The authors created a human-labeled dataset and a calibrated LLM-based evaluator that closely matches expert judgments, ensuring reliable assessment.

Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.16942 [Submitted on 18 Feb 2026] Title: SourceBench: Can AI Answers Reference Quality Web Jin , Stephen Liu , Yuheng Li , Simran Malik , Yiying Zhang View a PDF of the paper titled SourceBench: Can AI Answers Reference Quality Web Sources?, by Hexi Jin and 4 other authors View PDF HTML Abstract: Large language models increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16942 [cs.AI] (or arXiv:2602.16942v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.16942 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yiying Zhang [ view email ] [v1] Wed, 18 Feb 2026 23:15:32 UTC (187 KB) Full-text links: Access Paper: View a PDF of the paper titled SourceBench: Can AI Answers Reference Quality Web Sources?, by Hexi Jin and 4 other authors View PDF HTML TeX Source view license Current browse context: cs.AI < prev | next > new | recent | 2026-02 Change to browse by: cs References & Citations NASA ADS Google Scholar Semantic Scholar e...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine