2/20/2026 | USA | technology | ✓ Verified - arxiv.org

ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

#ReplaceMe #depth pruning #transformer blocks #linearization #training‑free #calibration dataset #open‑source library #NeurIPS #model simplification #performance retention #computational overhead #arXiv #CS.CL

📌 Key Takeaways

ReplaceMe is a training‑free depth pruning method that linearizes transformer blocks, requiring only a small calibration dataset.
The approach preserves up to 25% of the model’s layers, keeping roughly 90% of performance on open benchmarks without additional training.
ReplaceMe consistently outperforms other training‑free pruning techniques and competes with state‑of‑the‑art methods that involve retraining or architectural changes.
The technique eliminates extra model parameters by merging the linear mapping directly into remaining transformer blocks.
An open‑source library and the code repository were released to support wide adoption and further research.
The work was submitted to arXiv and accepted for presentation at NeurIPS 2025.
The latest arXiv version includes significant updates (v4) released on 19 February 2026.

📖 Full Retelling

The paper "ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization" was authored by Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, and Sergey Zagoruyko, and was submitted to the Computer Science > Computation and Language arXiv repository on 5 May 2025 (v1) and last revised on 19 February 2026 (v4). It introduces a training‑free depth pruning technique that replaces transformer blocks with a linear operation based on a small calibration dataset, thereby simplifying transformer models while retaining high performance. The method was presented at NeurIPS 2025, and the authors released an open‑source library to implement ReplaceMe in place of other depth pruning methods.

🏷️ Themes

Model compression, Transformer architecture, Depth pruning, Training‑free methods, Performance retention, Open‑source software, NeurIPS conference contributions

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

ReplaceMe offers a training-free way to prune transformer depth, cutting model size by up to 25% while keeping 90% of performance. This reduces inference cost and deployment barriers without extra training steps.

Context & Background

Transformer models are large and computationally heavy
Traditional pruning requires retraining or fine-tuning
ReplaceMe replaces blocks with linear operations using a small calibration set

What Happens Next

The library is expected to be integrated into popular NLP frameworks, enabling faster model deployment. Researchers may extend the technique to other architectures and explore deeper pruning ratios.

Frequently Asked Questions

How does ReplaceMe differ from other pruning methods?

It replaces transformer blocks with linear operations without any retraining, using only a small calibration dataset.

What size calibration dataset is needed?

Only a few hundred samples are sufficient to estimate the linear mapping.

Can ReplaceMe be applied to all transformer models?

It works on many large language models but may require adaptation for specific architectures.

Original Source

              --> Computer Science > Computation and Language arXiv:2505.02819 [Submitted on 5 May 2025 ( v1 ), last revised 19 Feb 2026 (this version, v4)] Title: ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization Authors: Dmitriy Shopkhoev , Ammar Ali , Magauiya Zhussip , Valentin Malykh , Stamatios Lefkimmiatis , Nikos Komodakis , Sergey Zagoruyko View a PDF of the paper titled ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization, by Dmitriy Shopkhoev and 6 other authors View PDF HTML Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models , ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this https URL Comments: This work was accepted and presented at NeurIPS 2025. Code is available at this https URL Reviews at OpenReview: this https URL NeurIPS 2025 Proceedings: this https URL Subjects: Computati...
            

Read full article at source

Source

arxiv.org