Zero-shot Composed Text-Image Retrieval


Yikun Liu1,2
Jiangchao Yao1,3
Ya Zhang1,3

Yanfeng Wang1,3
Weidi Xie1,3

1CMIC, Shanghai Jiao Tong University
2Beijing University of Posts and Telecommunications
3Shanghai AI Lab

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]


Abstract

In this paper, we consider the problem of composed image retrieval (CIR), with the goal of developing models that can understand and combine multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user’s expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benchmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models.



Architecture

An overview of our proposed architecture, that consists of a visual encoder, a text encoder, a Transformer module and an adaptive aggregation module.



The Dataset Construction Procedure

An overview of our proposed dataset construction procedure, based on sentence template (left), or large language models (right). In order to train the CIR model, we need to construct a dataset with triplet samples, i.e., reference image, relative caption, target image. Specifically, we start from the Laion-COCO that contains a massive number of image-caption pairs, and then edit the captions with sentence templates or large-language models to retrieve the target images.



Results

We compare our method with various zero-shot composed image retrieval methods on CIRR and FashionIQ. As shown in Table 3, on CIRR dataset, our proposed model achieves state-of-the-art results in all metrics except for Recall@50. While on the FashionIQ dataset, our proposed TransAgg model trained on the automatically constructed dataset also falls among the top2 best models, performing competitively with the concurrent work, namely CompoDiff [9]. Note that, CompoDiff has been trained on over 18M triplet samples, while ours only need to train on 16k/32k, significantly more efficient than CompoDiff.



Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.