A Sanity Check on Composed Image Retrieval

1School of Artificial Intelligence, Shanghai Jiao Tong University, China
2CMIC, Shanghai Jiao Tong University, China   
teaser

Our Motivation. We find that existing CIR models frequently struggle to retrieve the target image when confronted with an indeterminate composed query on mainstream benchmarks. To more accurately evaluate the performance of CIR models, we propose an evaluation suite to better monitor the progress, including a novel CIR benchmark and an automated multi-round evaluation framework.

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

Method

Overall Structure

An overview of our FISD benchmark. The left side shows the data construction process, which includes two stages: caption generation and image generation. The right side displays examples of data for different semantic aspects, where the reference image is marked with an orange frame, the target image with a red frame, and hard negative image with a gray frame.

Overall architecture

Overview of our automated multi-round evaluation framework. The user initially provides a reference image and the relative caption. Subsequently, the CIR model takes these inputs to generate a composed query feature, which is then stored in the history list. Next, the ranker uses the history list and all image features to determine candidate images. Finally, the selected candidate image becomes the reference image for the next round and is fed into the user simulator to generate the relative caption for the subsequent round.

Results

Quantitative Results

Multi-round evaluation on various state-of-the-art CIR models across a range of benchmarks.

Qualitative Results

qualitative results

BibTeX

@article{liu2026versavit,
        title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization}, 
        author={Yikun Liu and Yuan Liu and Shangzhe Di and Haicheng Wang and Zhongyin Zhao and Le Tian and Xiao Zhou and Jie Zhou and Jiangchao Yao and Yanfeng Wang and Weidi Xie},
        journal={arXiv preprint arXiv:2602.09934}, 
        year={2026}
}