TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
#TabDLM #tabular data generation #diffusion models #numerical-language modeling #synthetic data #machine learning research #data augmentation #foundation models
📌 Key Takeaways
- TabDLM is a unified framework for free-form tabular data generation using joint numerical-language diffusion modeling
- The approach combines the strengths of diffusion models and LLMs while overcoming their individual limitations
- TabDLM handles text through masked diffusion and numerical features through continuous diffusion with specialized tokens
- Experiments show TabDLM outperforms existing diffusion-based and LLM-based baseline approaches
📖 Full Retelling
Researchers Donghong Cai and five colleagues introduced TabDLM, a novel framework for free-form tabular data generation through joint numerical-language diffusion modeling in a paper submitted to arXiv on February 26, 2026, addressing the growing challenge of generating heterogeneous datasets containing both text fields and structured numerical attributes. The research tackles a critical problem in machine learning where real-world tabular datasets increasingly include free-form text fields such as reviews or clinical notes alongside traditional structured data, requiring sophisticated approaches to handle these different data types simultaneously. Existing methods fall into two categories: diffusion-based approaches that struggle with text quality and LLM-based methods that have difficulty preserving precise numerical values. TabDLM innovatively combines these approaches by modeling textual and categorical features through masked diffusion while handling numerical features with a continuous diffusion process using specialized numeric tokens embedding, with bidirectional attention capturing cross-modality interactions within a single unified model. Extensive experiments across diverse benchmarks demonstrated TabDLM's effectiveness compared to both diffusion-based and LLM-based baseline approaches, representing a significant advancement in synthetic tabular data generation with important applications for data augmentation, foundation models, and privacy preservation.
🏷️ Themes
Machine Learning, Data Generation, Diffusion Models
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Machine Learning arXiv:2602.22586 [Submitted on 26 Feb 2026] Title: TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion Authors: Donghong Cai , Jiarui Feng , Yanbo Wang , Da Zheng , Yixin Chen , Muhan Zhang View a PDF of the paper titled TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion, by Donghong Cai and 5 other authors View PDF HTML Abstract: Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical--language diffusion model built on masked diffusion language models . TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines. Comments: Prepri...
Read full article at source