POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

Yikun Liu*1,2, Yuan Liu*3, Le Tian3, Xiao Zhou3,
Jiangchao Yao2, Yanfeng Wang1, Weidi Xie1
1School of Artificial Intelligence, Shanghai Jiao Tong University
2CMIC, Shanghai Jiao Tong University    3WeChat AI, Tencent
Teaser

Overview of POINTS-Seeker. Our model adaptively interacts with external tools to solve complex, multi-hop VQA tasks. By leveraging V-Fold compression, stale history is rendered into compact visual tokens, effectively bypassing long-context performance degradation while achieving superior efficiency and reasoning fidelity.

Abstract

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

Method

Overall Structure

The overall framework of our proposed agentic search system. (a) Training stages including Agentic Seeding, trajectory-based SFT, and tool-augmented RL for policy optimization. (b) V-Fold adaptively renders stale interaction history into visual tokens to mitigate performance degradation in long-context scenarios while preserving reasoning efficacy through history-aware selective rendering.

Results

Quantitative Results

Comparison with state-of-the-art models.

Qualitative Results

Qualitative Results

BibTeX

@article{liu2024lamra,
        title={LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant}, 
        author={Yikun Liu and Pingan Chen and Jiayin Cai and Xiaolong Jiang and Yao Hu and Jiangchao Yao and Yanfeng Wang and Weidi Xie},
        journal={arXiv preprint arXiv:2412.01720}, 
        year={2024}
}