SP
BravenNow
When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper
| USA | technology | βœ“ Verified - arxiv.org

When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

#SAM-Audio #Whisper #zero-shot ASR #denoising #audio preprocessing #speech recognition #noise reduction

πŸ“Œ Key Takeaways

  • SAM-Audio's denoising can degrade Whisper's zero-shot ASR performance in clean audio conditions.
  • The study highlights a trade-off between noise reduction and speech recognition accuracy in audio processing.
  • Researchers recommend selective use of denoising based on audio quality to optimize ASR results.
  • The findings challenge assumptions that preprocessing always benefits automatic speech recognition systems.

πŸ“– Full Retelling

arXiv:2603.04710v1 Announce Type: cross Abstract: Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model p

🏷️ Themes

Audio Processing, Speech Recognition

πŸ“š Related People & Topics

Whispering

Speech without vocal cord vibration

Whispering is an unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate; air passes between the arytenoid cartilages to create audible turbulence during speech. Supralaryngeal articulation remains the same as in normal speech. In normal speech, the vocal cords a...

View Profile β†’ Wikipedia β†—

Entity Intersection Graph

Connections for Whispering:

πŸ‘€ Sony Pictures Television 1 shared
πŸ‘€ Jake Humphrey 1 shared
πŸ‘€ Sunil Patel 1 shared
View full profile

Mentioned Entities

Whispering

Speech without vocal cord vibration

}
Original Source
--> Computer Science > Sound arXiv:2603.04710 [Submitted on 5 Mar 2026] Title: When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper Authors: Akif Islam , Raufun Nahar , Md. Ekramul Hamid View a PDF of the paper titled When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper, by Akif Islam and 2 other authors View PDF HTML Abstract: Recent advances in automatic speech recognition and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate and Character Error Rate compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for m...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

πŸ‡¬πŸ‡§ United Kingdom

πŸ‡ΊπŸ‡¦ Ukraine