Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan

2/9/2026 | USA | technology

Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan

#Meituan #Multimodal Retrieval #Generative AI #Food Delivery #Staged Pretraining #Search Optimization #arXiv

📌 Key Takeaways

Meituan researchers introduced a multimodal generative retrieval model to improve food delivery search accuracy.
The model addresses the 'modality dominance' issue where text often overshadows visual data during AI training.
A staged pretraining strategy was implemented to ensure balanced learning across different data types.
The new architecture moves beyond traditional dual-tower models to provide more precise and diverse search results.

📖 Full Retelling

Researchers and engineers at Meituan, China’s leading food delivery and retail platform, have published a technical paper on the arXiv preprint server in February 2025 detailing a new multimodal generative retrieval model designed to enhance search precision for food delivery services. The team developed this advanced architecture to solve the complex challenge of aligning diverse user queries with a vast inventory of food items and merchants by leveraging both visual and textual data. This innovation addresses the growing demand for more personalized and accurate search results in high-frequency e-commerce environments where traditional search methods often fail to capture the nuances of user intent. The paper highlights a critical flaw in existing mainstream retrieval systems, which typically utilize a "dual-tower" architecture. In these traditional models, queries and items are processed separately through dedicated towers, with optimization occurring simultaneously across internal and external tasks. However, the Meituan researchers discovered that this joint optimization process often leads to "modality imbalance," where one type of data—such as text—dominates the training process, causing the model to underutilize other critical information like high-resolution food imagery or localized merchant metadata. To overcome these limitations, the proposed model introduces a staged pretraining strategy that carefully balances the influence of different data modalities. By decoupling the training phases, the system ensures that both visual features and semantic text descriptions contribute effectively to the retrieval process. This approach allows the generative model to better predict and surface relevant results, even when user queries are vague or highly specific. The implementation of this technology at Meituan represents a significant step forward in applying generative AI to real-world logistics and consumer service platforms. Ultimately, this research suggests that the future of digital marketplaces lies in the sophisticated integration of multimodal data. For a platform like Meituan, which handles millions of orders daily, the ability to provide precise, context-aware retrieval can significantly improve user satisfaction and merchant visibility. The staged pretraining framework not only improves the accuracy of the search engine but also provides a more robust foundation for future AI-driven features in the competitive food delivery landscape.

🏷️ Themes

Artificial Intelligence, E-commerce, Machine Learning

📚 Related People & Topics

Generative artificial intelligence

Subset of AI using generative models

# Generative Artificial Intelligence (GenAI) **Generative artificial intelligence** (also referred to as **generative AI** or **GenAI**) is a specialized subfield of artificial intelligence focused on the creation of original content. Utilizing advanced generative models, these systems are capable ...

Wikipedia →

Meituan

Chinese group buying and food delivery website

Meituan (Chinese: 美团; pinyin: Měituán; formerly Meituan–Dianping) is a Chinese technology company that operates a platform for local services including on‑demand food delivery, in‑store services and consumer reviews under the Dazhong Dianping brand, hotel and travel bookings, and instant retail. The...

Wikipedia →

🔗 Entity Intersection Graph

Connections for Generative artificial intelligence:

🌐 Machine learning (4 shared articles)
🌐 Large language model (3 shared articles)
🌐 ChatGPT (3 shared articles)
🏢 Databricks (2 shared articles)
🌐 Software as a service (2 shared articles)
🌐 Meta (2 shared articles)
🌐 Artificial intelligence (2 shared articles)
🌐 Chatbot (2 shared articles)
🌐 Apple (2 shared articles)
🏢 OpenAI (2 shared articles)
🏢 Enterprise software (1 shared articles)
👤 Ali Ghodsi (1 shared articles)

View full profile →

📄 Original Source Content

arXiv:2602.06654v1 Announce Type: cross Abstract: Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the traini

Original source

Точка Синхронізації

Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

📚 Related People & Topics

Generative artificial intelligence

Meituan

🔗 Entity Intersection Graph

More from USA

News from Other Countries

🇵🇱 Poland

🇬🇧 United Kingdom

🇺🇦 Ukraine

🇮🇳 India