VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Yikun Liu1,2, Yuan Liu3, Shangzhe Di1,2, Haicheng Wang1,2, Zhongyin Zhao3,
Le Tian3, Xiao Zhou3, Jie Zhou3, Jiangchao Yao2, Yanfeng Wang1, Weidi Xie1
1School of Artificial Intelligence, Shanghai Jiao Tong University, China
2CMIC, Shanghai Jiao Tong University, China    3WeChat AI, Tencent Inc., China
Teaser

Overcoming the dense feature limitation of vision backbone within MLLMs. We observe the vision encoder within MLLMs (Top), which yields strong VQA but suboptimal dense features. Conversely, our multi-task collaborative post-training (Bottom) is designed to overcome this limitation by comprehensively enhancing the vision encoder's capabilities.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating strong high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) We identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation). (ii) We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task collaborative post-training framework. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision. (iii) Extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.

Method

Overall Structure

Overview of the proposed multi-task collaborative training framework. The proposed framework jointly trains three distinct tasks: VQA and Image Captioning, Monocular Depth Estimation, and Image Referring Segmentation. By incorporating lightweight task heads, this collaborative training strategy is designed to enhance the representational capabilities of the underlying vision backbone.

Results

Quantitative Results

Comparison of various methods on the OpenCompass benchmarks, along with linear probing results for semantic segmentation and monocular depth estimation using frozen backbones. The results demonstrate that our method achieves superior performance across all evaluated tasks, indicating the effectiveness of our multi-task collaborative training framework in enhancing the vision backbone's capabilities for both high-level semantic understanding and dense prediction tasks.

Qualitative Results

Teaser

BibTeX

@article{liu2026versavit,
        title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization}, 
        author={Yikun Liu and Yuan Liu and Shangzhe Di and Haicheng Wang and Zhongyin Zhao and Le Tian and Xiao Zhou and Jie Zhou and Jiangchao Yao and Yanfeng Wang and Weidi Xie},
        journal={arXiv preprint arXiv:2602.09934}, 
        year={2026}
}