SP
BravenNow
K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation
| USA | technology | ✓ Verified - arxiv.org

K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation

#K-Gen #multimodal #language-conditioned #keypoint-guided #trajectory generation #interpretable AI #robotics

📌 Key Takeaways

  • K-Gen is a multimodal AI model for generating trajectories based on language instructions and keypoints.
  • It uses language conditioning to interpret user commands for trajectory planning.
  • The approach emphasizes interpretability through keypoint-guided generation.
  • K-Gen integrates multiple data modalities to enhance trajectory accuracy and adaptability.

📖 Full Retelling

arXiv:2603.04868v1 Announce Type: new Abstract: Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models (LLMs) show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable keypoint-guided multimodal framework that leverages Multimodal Large Language Models (MLLMs) to unify ras

🏷️ Themes

AI Trajectory Generation, Multimodal Learning

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
--> Computer Science > Artificial Intelligence arXiv:2603.04868 [Submitted on 5 Mar 2026] Title: K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation Authors: Mingxuan Mu , Guo Yang , Lei Chen , Ping Wu , Jianxun Cui View a PDF of the paper titled K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation, by Mingxuan Mu and 3 other authors View PDF HTML Abstract: Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable keypoint-guided multimodal framework that leverages Multimodal Large Language Models to unify rasterized BEV map inputs with textual scene descriptions. Instead of directly predicting full trajectories, K-Gen generates interpretable keypoints along with reasoning that reflects agent intentions, which are subsequently refined into accurate trajectories by a refinement module. To further enhance keypoint generation, we apply T-DAPO, a trajectory-aware reinforcement fine-tuning algorithm. Experiments on WOMD and nuPlan demonstrate that K-Gen outperforms existing baselines, highlighting the effectiveness of combining multimodal reasoning with keypoint-guided trajectory generation. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.04868 [cs.AI] (or arXiv:2603.04868v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.04868 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jianxun Cui Prof. [ view email ] [v1] Thu, 5 Mar 2026 06:48:12 UTC (3,377 KB) Full-text links: Access Paper: View a PDF of the paper titled K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generati...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine