When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper
#SAM-Audio #Whisper #zero-shot ASR #denoising #audio preprocessing #speech recognition #noise reduction
π Key Takeaways
- SAM-Audio's denoising can degrade Whisper's zero-shot ASR performance in clean audio conditions.
- The study highlights a trade-off between noise reduction and speech recognition accuracy in audio processing.
- Researchers recommend selective use of denoising based on audio quality to optimize ASR results.
- The findings challenge assumptions that preprocessing always benefits automatic speech recognition systems.
π Full Retelling
arXiv:2603.04710v1 Announce Type: cross
Abstract: Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model p
π·οΈ Themes
Audio Processing, Speech Recognition
π Related People & Topics
Whispering
Speech without vocal cord vibration
Whispering is an unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate; air passes between the arytenoid cartilages to create audible turbulence during speech. Supralaryngeal articulation remains the same as in normal speech. In normal speech, the vocal cords a...
Entity Intersection Graph
Connections for Whispering:
View full profileMentioned Entities
Original Source
--> Computer Science > Sound arXiv:2603.04710 [Submitted on 5 Mar 2026] Title: When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper Authors: Akif Islam , Raufun Nahar , Md. Ekramul Hamid View a PDF of the paper titled When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper, by Akif Islam and 2 other authors View PDF HTML Abstract: Recent advances in automatic speech recognition and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate and Character Error Rate compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for m...
Read full article at source