PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
#PersianPunc #dataset #BERT #punctuation restoration #Persian #NLP #text processing
📌 Key Takeaways
- PersianPunc is a new large-scale dataset for Persian punctuation restoration.
- It introduces a BERT-based model to improve punctuation accuracy in Persian text.
- The approach addresses a gap in resources for Persian natural language processing.
- The dataset and model aim to enhance text readability and downstream NLP tasks.
📖 Full Retelling
arXiv:2603.05314v1 Announce Type: cross
Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a
🏷️ Themes
NLP, Persian Language
📚 Related People & Topics
Entity Intersection Graph
Connections for NLP:
🌐
Urdu
1 shared
🌐
Ethics of artificial intelligence
1 shared
Mentioned Entities
Original Source
--> Computer Science > Computation and Language arXiv:2603.05314 [Submitted on 5 Mar 2026] Title: PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Authors: Mohammad Javad Ranjbar Kalahroodi , Heshaam Faili , Azadeh Shakery View a PDF of the paper titled PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration, by Mohammad Javad Ranjbar Kalahroodi and 2 other authors View PDF Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset ( this https URL ) and model ( this https URL ) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05314 [cs.CL] (or arXiv:2603.05314v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.05314 Focus to learn more arXiv-issued DOI via DataCi...
Read full article at source