3/2/2026 | USA | technology | ✓ Verified - arxiv.org

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

#hyperdimensional computing #cross-modal alignment #frozen language model #frozen image model #image captioning #binding #bundling #similarity retrieval #zero‑shot baseline #semantic grounding

📌 Key Takeaways

Introduction of HDFLIM, a framework that aligns frozen language and image models without parameter updates.
Use of hyperdimensional computing to project unimodal embeddings into a common high‑dimensional space.
Implementation of lightweight symbolic operations—binding, bundling, and similarity‑based retrieval—to create cross‑modal associations in one pass over the data.
Caption generation is achieved via high‑dimensional memory retrieval rather than iterative gradient optimization.
HDFLIM achieves performance on par with end‑to‑end vision‑language training and produces captions that are more semantically grounded than zero‑shot baselines.
Results suggest a new paradigm for foundation‑model alignment that relies on structured representational mappings instead of large‑scale retraining.

📖 Full Retelling

Abhishek Dalvi and a co‑author, who submitted their work to arXiv on 27 February 2026, present a new framework called HDFLIM that aligns frozen language and image foundation models for image captioning by projecting embeddings into a shared hyperdimensional space—this method was developed in the context of computer vision and AI research and aims to reduce the computational cost and risk of perturbing pretrained representations while achieving caption quality comparable to fully fine‑tuned multimodal systems.

🏷️ Themes

Foundation models, Cross‑modal alignment, Hyperdimensional computing, Frozen model integration, Computational efficiency in multimodal learning, Symbolic operations in deep learning

Entity Intersection Graph

No entity connections available yet for this article.

}

Original Source

              --> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.23588 [Submitted on 27 Feb 2026] Title: Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning Authors: Abhishek Dalvi , Vasant Honavar View a PDF of the paper titled Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning, by Abhishek Dalvi and 1 other authors View PDF HTML Abstract: Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations -- binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdime...
            

Read full article at source

Source

arxiv.org

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

Entity Intersection Graph

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine