Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
#GPT-5 #Large Language Models #Citation Context Analysis #Prompt Sensitivity #Interpretative Analysis #Academic Research #ArXiv #Text Grounded Readings
π Key Takeaways
- GPT-5 shows promise for interpretative citation context analysis through 'thick' readings rather than typological labels
- Prompt scaffolding and framing significantly influence the model's interpretative outputs
- The study identified 21 recurring interpretative moves in GPT-5's reconstructions
- GPT-5 consistently classified citations as 'supplementary' but showed varied interpretative approaches compared to human analysis
π Full Retelling
Arno Simons published a research paper on arXiv on February 25, 2026, that tests whether large language models like GPT-5 can support interpretative citation context analysis through 'thick' text-grounded readings of a single case rather than relying on typological labels. The study investigates how prompt-sensitivity affects the model's interpretative capabilities by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 from Chubin and Moitra's 1975 work and Gilbert's 1977 reconstruction as a probe case, Simons implemented a two-stage GPT-5 pipeline to examine citation classification and cross-document interpretative reconstruction. The research methodology involved implementing a two-stage GPT-5 pipeline: first, a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using both citing and cited full texts. Across 90 reconstructions, the model produced 450 distinct hypotheses. Through close reading and inductive coding, Simons identified 21 recurring interpretative moves and used linear probability models to estimate how prompt choices affected their frequencies and lexical repertoire. The study revealed that GPT-5's surface pass was highly stable, consistently classifying the citation as 'supplementary.' However, during the reconstruction phase, the model generated a structured space of plausible alternatives, with scaffolding and examples redistributing attention and vocabulary, sometimes leading to strained readings. When compared to Gilbert's analysis, GPT-5 detected the same textual hinges but more often resolved them as lineage and positioning rather than as admonishment. Simons concludes by outlining both opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative citation context analysis, demonstrating that prompt engineering significantly influences which interpretations the model produces.
π·οΈ Themes
Artificial Intelligence, Academic Research, Prompt Engineering, Citation Analysis
π Related People & Topics
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Entity Intersection Graph
Connections for Large language model:
π
Educational technology
4 shared
π
Reinforcement learning
3 shared
π
Machine learning
2 shared
π
Artificial intelligence
2 shared
π
Benchmark
2 shared
Original Source
--> Computer Science > Computation and Language arXiv:2602.22359 [Submitted on 25 Feb 2026] Title: Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts Authors: Arno Simons View a PDF of the paper titled Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts, by Arno Simons View PDF Abstract: This paper tests whether large language models can support interpretative citation context analysis by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds. Comments: 26 pages, 1 figure, 3 tables (plus 17 pa...
Read full article at source