SP
BravenNow
Personal Information Parroting in Language Models
| USA | technology | ✓ Verified - arxiv.org

Personal Information Parroting in Language Models

#language models #personal information #privacy risks #data memorization #Pythia model #regex detection #data anonymization #AI ethics

📌 Key Takeaways

  • Language models memorize personal information from training data
  • Researchers developed a superior detector suite for personal information
  • 13.6% of personal information instances are parroted verbatim by larger models
  • Both model size and pretraining amount correlate with memorization
  • Aggressive filtering of datasets is recommended to reduce privacy risks

📖 Full Retelling

Researchers Nishant Subramani, Kshitish Ghate, and Mona Diab discovered that language models memorize personal information from training data, publishing their findings on arXiv on February 24, 2026, due to privacy risks associated with models parroting personal information. The team developed a regexes and rules (R&R) detector suite to identify email addresses, phone numbers, and IP addresses, which outperforms existing regex-based personal information detectors. Their research revealed that on a manually curated set of 483 personal information instances, 13.6% are parroted verbatim by the Pythia-6.9b model when prompted with tokens that precede the personal information in the original document. The study expanded to analyze models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite, finding that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model tested, Pythia-160m, parroted 2.7% of instances exactly, leading the researchers to strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize personal information parroting.

🏷️ Themes

AI Privacy, Data Security, Machine Learning Ethics

📚 Related People & Topics

Ethics of artificial intelligence

The ethics of artificial intelligence covers a broad range of topics within AI that are considered to have particular ethical stakes. This includes algorithmic biases, fairness, accountability, transparency, privacy, and regulation, particularly where systems influence or automate human decision-mak...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Ethics of artificial intelligence:

🏢 Anthropic 10 shared
🌐 Pentagon 10 shared
🏢 OpenAI 7 shared
👤 Dario Amodei 4 shared
🌐 National security 3 shared
View full profile
Original Source
--> Computer Science > Computation and Language arXiv:2602.20580 [Submitted on 24 Feb 2026] Title: Personal Information Parroting in Language Models Authors: Nishant Subramani , Kshitish Ghate , Mona Diab View a PDF of the paper titled Personal Information Parroting in Language Models, by Nishant Subramani and 2 other authors View PDF HTML Abstract: Modern language models are trained on large scrapes of the Web, containing millions of personal information instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting. Comments: EACL Findings 2026 Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2602.20580 [cs.CL] (or arXiv:2602.20580v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20580 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nishant Subramani [ view email ] [v1] Tue, 24 Feb 2026 06:02:03 UTC (9,439 KB) Full-text links: Access Paper: View a PDF of the paper titled Personal Information P...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine