Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating strong high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) We identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation). (ii) We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task collaborative post-training framework. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision. (iii) Extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
@article{liu2026versavit,
title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
author={Yikun Liu and Yuan Liu and Shangzhe Di and Haicheng Wang and Zhongyin Zhao and Le Tian and Xiao Zhou and Jie Zhou and Jiangchao Yao and Yanfeng Wang and Weidi Xie},
journal={arXiv preprint arXiv:2602.09934},
year={2026}
}