MultiModalPFN extends TabPFN to handle both tabular and non-tabular data types
The model uses modality projectors to transform non-tabular data into tabular-compatible tokens
Researchers introduced multi-head gated MLP and cross-attention pooler to enhance context extraction
MMPFN outperformed state-of-the-art methods on medical and general multimodal datasets
The research was accepted to CVPR 2026 conference
📖 Full Retelling
Researchers Wall Kim, Chaeyoung Song, and Hanul Kim introduced MultiModalPFN, an advanced machine learning model extending Prior-Data Fitted Networks for multimodal tabular learning, in a paper submitted to arXiv on February 23, 2026, addressing the limitation of existing TabPFN models in handling diverse data types common in healthcare and marketing domains. The MultiModalPFN represents a significant advancement in foundation models for tabular data by enabling the integration of heterogeneous modalities such as images and text. The innovative architecture comprises per-modality encoders, modality projectors, and pre-trained foundation models, with the modality projectors serving as critical bridges that transform non-tabular embeddings into tabular-compatible tokens for unified processing. To enhance performance, the researchers introduced a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigating attention imbalance issues in multimodal learning scenarios. The effectiveness of MultiModalPFN was demonstrated through extensive experiments on both medical and general-purpose multimodal datasets, where the model consistently outperformed competitive state-of-the-art methods, with the source code now publicly available for implementation in various domains.
🏷️ Themes
Machine Learning, Multimodal Learning, Foundation Models, Data Integration
TabPFN (Tabular Prior-data Fitted Network) is a machine learning model for tabular datasets proposed in 2022. It uses a transformer architecture. It is intended for supervised classification and regression analysis on small- to medium-sized datasets, e.g., up to 10,000 samples.
Machine learning methods using multiple input modalities
Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question...
Study of algorithms that improve automatically through experience
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances i...
No entity connections available yet for this article.
Original Source
--> Computer Science > Machine Learning arXiv:2602.20223 [Submitted on 23 Feb 2026] Title: MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning Authors: Wall Kim , Chaeyoung Song , Hanul Kim View a PDF of the paper titled MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning, by Wall Kim and 2 other authors View PDF HTML Abstract: Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network , which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at this https URL . Comments: Accepted to CVPR 2026 Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20223 [cs.LG] (or arXiv:2602.20223v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20223 Focus to learn more arXiv-issued DOI via DataCite Submission his...