2/20/2026 | USA | technology | ✓ Verified - arxiv.org

Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization

#Predictive Batch Scheduling #Loss‑Aware Scheduling #Transformer #Token Frequency #Sequence Length #Vocabulary Diversity #Rare Token Ratio #Fast Convergence #Linear Predictor #arXiv

📌 Key Takeaways

PBS prioritizes high‑loss samples using a linear predictor trained online.
The predictor relies on four token‑level features: token frequency, sequence length, vocabulary diversity, and rare‑token ratio.
Achieves 0.44 correlation with true loss, improving from 0.14 over 10,000 training steps.
Demonstrated 6–13 % faster convergence on a 130M‑parameter transformer.
Offers negligible computational overhead compared to hard‑example mining methods.

📖 Full Retelling

Sumedh Rasal, in his February 19 2026 arXiv submission (cs.AI – Artificial Intelligence), introduces Predictive Batch Scheduling (PBS), a lightweight training‑time optimizer that accelerates transformer language‑model convergence by ordering training batches toward high‑loss samples. Published on arXiv and available via DOI 10.48550/arXiv.2602.17066, the work demonstrates how static token‑level features can predict sample difficulty and improve training speed by 6–13 % without incurring the cost of per‑sample loss tracking.

🏷️ Themes

Machine Learning Optimization, Curriculum Learning, Transformer Training, Compute‑Efficiency, Feature‑Based Prediction

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

Predictive Batch Scheduling speeds up language model training by prioritizing difficult samples, reducing training time by up to 13% without heavy computational cost.

Context & Background

Training large language models is costly and time consuming
Existing curriculum learning methods require predefined difficulty metrics
Hard example mining demands expensive per-sample loss tracking
PBS uses a lightweight predictor with only four token-level features
Experiments show 6-13\% faster convergence on a 130M transformer

What Happens Next

Researchers may extend PBS to larger models and other domains, integrate it into mainstream training pipelines, and explore additional features to improve predictor accuracy.

Frequently Asked Questions

What is Predictive Batch Scheduling?

It is a training optimization that dynamically selects high-loss samples for each batch using a lightweight online predictor.

How does PBS differ from traditional curriculum learning?

PBS does not need predefined difficulty scores; it learns a simple linear predictor from token statistics during training.

What benefits does PBS provide?

It achieves up to 13\% faster convergence with negligible extra computation, lowering training cost.

Can PBS be used with other model architectures?

Yes, the approach is generic and can be applied to other transformer-based or sequence models.

}

Original Source

              --> Computer Science > Artificial Intelligence arXiv:2602.17066 [Submitted on 19 Feb 2026] Title: Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization Authors: Sumedh Rasal View a PDF of the paper titled Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization, by Sumedh Rasal View PDF HTML Abstract: We introduce Predictive Batch Scheduling , a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning approaches that require predefined difficulty metrics or hard example mining methods that demand expensive per-sample loss tracking, PBS employs a lightweight linear predictor trained online to estimate sample difficulty from static token-level features. Our predictor achieves 0.44 correlation with actual loss using only four simple features: token frequency, sequence length, vocabulary diversity, and rare token ratio. Experiments on a 130M parameter transformer demonstrate that PBS achieves 6-13\% faster convergence measured by evaluation loss across training checkpoints, with the predictor's correlation improving from 0.14 to 0.44 over 10,000 training steps. These results validate that token frequency statistics encode meaningful information about sample difficulty, enabling effective curriculum learning with negligible computational overhead. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.17066 [cs.AI] (or arXiv:2602.17066v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.17066 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sumedh Rasal [ view email ] [v1] Thu, 19 Feb 2026 04:15:39 UTC (11 KB) Full-text links: Access Paper: View a PDF of the paper titled Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization, by ...
            

Read full article at source

Source

arxiv.org