SP
BravenNow
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
| USA | technology | ✓ Verified - arxiv.org

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

#MMHNet #Video-to-Audio Generation #Length Generalization #Multimodal AI #Hierarchical Networks #Long-form Audio #CVPR 2026

📌 Key Takeaways

  • Researchers developed MMHNet, a multimodal hierarchical network for video-to-audio generation
  • The model can generate audio content exceeding 5 minutes, surpassing previous limitations
  • Training on short instances and testing on longer ones is possible without training on longer durations
  • The approach integrates hierarchical methods and non-causal Mamba for long-form audio generation
  • The research was accepted to CVPR 2026 and represents a significant advancement in multimodal AI

📖 Full Retelling

Researchers led by Christian Simon and 10 collaborators introduced MMHNet, a multimodal hierarchical network for video-to-audio generation models, in a paper submitted to arXiv on February 24, 2026, addressing the challenge of scaling multimodal alignment between video and audio data, particularly examining whether models trained on short instances can generalize to longer ones during testing. The paper, titled 'Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models,' was accepted to CVPR 2026 and presents an enhanced extension of state-of-the-art video-to-audio models. The researchers identified that scaling multimodal alignment between video and audio is challenging due to limited data and the mismatch between text descriptions and frame-level video information. Their approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation, significantly improving long audio generation up to more than 5 minutes. The researchers demonstrated that training on short instances and testing on longer ones is possible in video-to-audio generation tasks without requiring training on longer durations. Their experiments showed that MMHNet achieved remarkable results on long-video to audio benchmarks, outperforming previous works in video-to-audio tasks. The model can generate audio content exceeding 5 minutes, while prior video-to-audio methods struggled with long durations. This breakthrough has significant implications for applications requiring synchronized audio-visual content over extended periods, such as film production, virtual reality environments, and multimedia content creation.

🏷️ Themes

Multimodal AI, Length Generalization, Video-to-Audio Generation

📚 Related People & Topics

Multimodal learning

Machine learning methods using multiple input modalities

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Multimodal learning:

🌐 Clip 2 shared
🏢 TabPFN 1 shared
🌐 Machine learning 1 shared
🌐 Reinforcement learning 1 shared
🌐 Computer vision 1 shared
View full profile
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20981 [Submitted on 24 Feb 2026] Title: Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models Authors: Christian Simon , MAsato Ishii , Wei-Yao Wang , Koichi Saito , Akio Hayakawa , Dongseok Shim , Zhi Zhong , Shuyang Cui , Shusuke Takahashi , Takashi Shibuya , Yuki Mitsufuji View a PDF of the paper titled Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models, by Christian Simon and 10 other authors View PDF HTML Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations. Comments: Accepted to CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20981 [cs.CV] (or arXiv:2602.20981v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.20981 Focus to...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine