SP
BravenNow
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
| USA | technology | ✓ Verified - arxiv.org

Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

📌 Key Takeaways

  • {"type":"skipped","reason":"older_than_3_days"}

📖 Full Retelling

arXiv:2602.23153v1 Announce Type: cross Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effective

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.23153 [Submitted on 26 Feb 2026] Title: Efficient Encoder-Free Fourier-based 3D Large Multimodal Model Authors: Guofeng Mei , Wei Lin , Luigi Riz , Yujiao Wu , Yiming Wang , Fabio Poiesi View a PDF of the paper titled Efficient Encoder-Free Fourier-based 3D Large Multimodal Model, by Guofeng Mei and Wei Lin and Luigi Riz and Yujiao Wu and Yiming Wang and Fabio Poiesi View PDF HTML Abstract: Large Multimodal Models that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23153 [cs.CV] (or arXiv:2602.23153v1 [cs....
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine