DMCD: Semantic-Statistical Framework for Causal Discovery
#DMCD #Causal Discovery #Large Language Model #Semantic Reasoning #Statistical Validation #Directed Acyclic Graph #Conditional Independence Testing
📌 Key Takeaways
- DMCD integrates LLM-based semantic drafting with statistical validation for causal discovery
- The framework uses a large language model to create a sparse draft DAG as a semantically informed prior
- Phase II refines this draft through conditional independence testing
- DMCD achieved competitive performance against diverse baselines, especially in recall and F1 scores
- Improvements come from semantic reasoning rather than memorization of benchmark graphs
📖 Full Retelling
Samarth KaPatel, Sofia Nikiforova, Giacinto Paolo Saggese, and Paul Smith introduced DMCD (DataMap Causal Discovery), a novel two-phase causal discovery framework, on arXiv on February 23, 2026, aiming to enhance causal structure learning by combining semantic reasoning with statistical validation. The framework operates in two distinct phases: initially using a large language model to analyze variable metadata and propose a sparse draft Directed Acyclic Graph (DAG), which serves as a semantically informed prior over possible causal structures, followed by a refinement process through conditional independence testing where detected discrepancies guide targeted edge revisions. This hybrid approach effectively leverages both the contextual understanding capabilities of modern language models and the statistical rigor of traditional causal discovery methods. The researchers evaluated their framework across three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis, demonstrating that DMCD achieved competitive or leading performance against diverse causal discovery baselines, with particularly notable improvements in recall and F1 score.
🏷️ Themes
Artificial Intelligence, Causal Discovery, Machine Learning
📚 Related People & Topics
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Entity Intersection Graph
Connections for Large language model:
🌐
Educational technology
4 shared
🌐
Reinforcement learning
3 shared
🌐
Machine learning
2 shared
🌐
Artificial intelligence
2 shared
🌐
Benchmark
2 shared
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.20333 [Submitted on 23 Feb 2026] Title: DMCD: Semantic-Statistical Framework for Causal Discovery Authors: Samarth KaPatel , Sofia Nikiforova , Giacinto Paolo Saggese , Paul Smith View a PDF of the paper titled DMCD: Semantic-Statistical Framework for Causal Discovery, by Samarth KaPatel and 3 other authors View PDF HTML Abstract: We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20333 [cs.AI] (or arXiv:2602.20333v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20333 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Giacinto Paolo Saggese [ view email ] [v1] Mon, 23 Feb 2026 20:29:35 UTC (138 KB) Full-text links: Access Paper: View a PDF of the paper titled DMCD: Semantic-Statistical F...
Read full article at source