#Multimodal AI
Latest news articles tagged with "Multimodal AI". Follow the timeline of events, related topics, and entities.
Articles (7)
-
๐บ๐ธ HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
[USA]
arXiv:2506.03922v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchm...
Related: #AI Benchmarking, #Interdisciplinary Research -
๐บ๐ธ Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
[USA]
arXiv:2602.20981v1 Announce Type: cross Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and ...
Related: #Length Generalization, #Video-to-Audio Generation -
๐บ๐ธ Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
[USA]
arXiv:2502.17028v3 Announce Type: replace-cross Abstract: Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approache...
Related: #Machine Learning, #Vision-Language Alignment -
๐บ๐ธ Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling
[USA]
arXiv:2602.15513v1 Announce Type: cross Abstract: Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limi...
Related: #Embodied Agents, #Memory Modeling, #Natural Language Processing, #Computer Vision -
๐บ๐ธ Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
[USA]
arXiv:2602.15772v1 Announce Type: cross Abstract: Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and ...
Related: #Generation vs. Understanding, #Model Optimization, #Reasoning and Reflection, #Tradeโoff Analysis -
๐บ๐ธ Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
[USA]
arXiv:2602.12618v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pr...
Related: #Computational efficiency, #Model optimization -
๐บ๐ธ Artic: AI-oriented Real-time Communication for MLLM Video Assistant
[USA]
arXiv:2602.12641v1 Announce Type: cross Abstract: AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed i...
Related: #AI Communication, #Real-time Systems