PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
#PVminer #Patient Voice #Natural Language Processing #Healthcare Communication #Social Determinants of Health #Machine Learning #BERT Models #Patient-Generated Data
📌 Key Takeaways
- PVminer is a new NLP framework specifically designed to detect patient voice in healthcare communications
- The tool integrates patient-specific BERT encoders and topic modeling to improve analysis accuracy
- Research shows PVminer outperforms existing biomedical and clinical pre-trained models
- The research team will publicly release models and documentation to advance healthcare communication research
📖 Full Retelling
A team of nine researchers led by Samah Fodeh introduced PVminer on February 24, 2026, a new domain-specific tool designed to detect and analyze the patient voice in patient-generated data, addressing limitations in traditional qualitative coding methods and existing machine learning approaches that fail to fully capture patient-centered communication and social determinants of health. PVminer represents a significant advancement in natural language processing specifically tailored for healthcare communications, addressing the challenge of analyzing vast amounts of patient-generated text from secure messages, surveys, and interviews that contains valuable insights into patient experiences. Traditional methods of analyzing this data through manual qualitative coding are labor-intensive and cannot scale to the large volumes of patient-authored messages across health systems, while existing ML and NLP approaches have provided only partial solutions, often treating patient-centered communication and social determinants of health as separate tasks or relying on models not well-suited to patient-facing language. The PVminer framework formulates patient voice detection as a multi-label, multi-class prediction task, integrating patient-specific BERT encoders, unsupervised topic modeling for thematic augmentation, and fine-tuned classifiers for hierarchical labels, with testing showing strong performance that outperforms biomedical and clinical pre-trained baselines and demonstrating that both author identity and topic-based augmentation contribute meaningful performance gains.
🏷️ Themes
Healthcare Technology, Natural Language Processing, Patient-Centered Care
📚 Related People & Topics
Natural language processing
Processing of natural language by a computer
Natural language processing (NLP) is the processing of natural language information by a computer. NLP is a subfield of computer science and is closely associated with artificial intelligence. NLP is also related to information retrieval, knowledge representation, computational linguistics, and ling...
Entity Intersection Graph
Connections for Natural language processing:
🌐
Languages of Africa
1 shared
🌐
Systematic review
1 shared
🌐
Machine learning
1 shared
🌐
Curriculum learning
1 shared
🌐
Artificial intelligence
1 shared
Original Source
--> Computer Science > Computation and Language arXiv:2602.21165 [Submitted on 24 Feb 2026] Title: PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data Authors: Samah Fodeh , Linhai Ma , Yan Wang , Srivani Talakokkul , Ganesh Puthiaraju , Afshan Khan , Ashley Hagaman , Sarah Lowe , Aimee Roundtree View a PDF of the paper titled PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data, by Samah Fodeh and 8 other authors View PDF HTML Abstract: Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice , reflecting communicative behaviors and social determinants of health . Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning and natural language processing approaches provide partial solutions but often treat patient-centered communication and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% , 80.14% , and up to 77.87% . An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datase...
Read full article at source