LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Yikun Liu1,2, Pingan Chen3, Jiayin Cai3, Xiaolong Jiang3, Yao Hu3,
Jiangchao Yao2, Yanfeng Wang1, Weidi Xie1
1School of Artificial Intelligence, Shanghai Jiao Tong University, China
2CMIC, Shanghai Jiao Tong University, China    3Xiaohongshu Inc., China
Teaser

The LamRA framework empowers Large Multimodal Models with advanced retrieval and reranking capabilities. (a) LamRA enhances LMMs with universal retrieval and reranking capabilities by inserting lightweight LoRA modules into the LMMs. (b) Examples of varied retrieval tasks demonstrate LamRA's capability to handle diverse retrieval tasks. (c) Performance comparison on the M-BEIR test set shows LamRA's superior performance across a wide range of retrieval tasks. For instance, \( q^t \to c^i \) represents text-to-image retrieval.

Abstract

With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.

Method

Overall Structure

Overview of the proposed LamRA framework. LamRA consists of two components: LamRA-Ret and LamRA-Rank. The top section illustrates LamRA-Ret, encompassing both the pre-training and instruction-tuning stages, where contrastive learning is employed to enhance the retrieval capability of LMMs. The pre-training stage aims to improve the feature extraction capabilities through text-to-text retrieval, while the instruction tuning stage adapts the LMMs to various retrieval tasks by fine-tuning on diverse tasks with task-specific instructions. The bottom section depicts the joint training process of LamRA-Rank, which integrates both pointwise and listwise reranking.

Results

Quantitative Results

Comparison with state-of-the-arts on M-BEIR test set. The first row indicates the retrieval task type: \( q^t \) for text queries, \( q^i \) for image queries, \( c^t \) for text candidates, and \( c^i \) for image candidates. Abbreviations used include VN for VisualNews, F200K for Fashion200K, InfoS for InfoSeek, and FIQ for FashionIQ. Evaluation standards follow UniIR, with FashionIQ and Fashion200K using Recall@10, while all other evaluations employ Recall@5. The best (resp. second-best) numbers are in red (resp. blue). For more detailed quantitative results, please refer to the paper.

Qualitative Results

Qualitative Results

BibTeX

@article{liu2024lamra,
        title={LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant}, 
        author={Yikun Liu and Pingan Chen and Jiayin Cai and Xiaolong Jiang and Yao Hu and Jiangchao Yao and Yanfeng Wang and Weidi Xie},
        journal={arXiv preprint arXiv:2412.01720}, 
        year={2024}
}