SP
BravenNow
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
| USA | technology | ✓ Verified - arxiv.org

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

#PersianPunc #dataset #BERT #punctuation restoration #Persian #NLP #text processing

📌 Key Takeaways

  • PersianPunc is a new large-scale dataset for Persian punctuation restoration.
  • It introduces a BERT-based model to improve punctuation accuracy in Persian text.
  • The approach addresses a gap in resources for Persian natural language processing.
  • The dataset and model aim to enhance text readability and downstream NLP tasks.

📖 Full Retelling

arXiv:2603.05314v1 Announce Type: cross Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a

🏷️ Themes

NLP, Persian Language

📚 Related People & Topics

NLP

Topics referred to by the same term

NLP commonly refers to:

View Profile → Wikipedia ↗

Persian

Topics referred to by the same term

Persian may refer to:

View Profile → Wikipedia ↗

Bert

Topics referred to by the same term

Bert or BERT may refer to:

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for NLP:

🌐 Urdu 1 shared
🌐 Ethics of artificial intelligence 1 shared
View full profile

Mentioned Entities

NLP

Topics referred to by the same term

Persian

Topics referred to by the same term

Bert

Topics referred to by the same term

}
Original Source
--> Computer Science > Computation and Language arXiv:2603.05314 [Submitted on 5 Mar 2026] Title: PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Authors: Mohammad Javad Ranjbar Kalahroodi , Heshaam Faili , Azadeh Shakery View a PDF of the paper titled PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration, by Mohammad Javad Ranjbar Kalahroodi and 2 other authors View PDF Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset ( this https URL ) and model ( this https URL ) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05314 [cs.CL] (or arXiv:2603.05314v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.05314 Focus to learn more arXiv-issued DOI via DataCi...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine