Olympus: A Universal Task Router for Computer Vision Tasks

CVPR 2025 Highlight

Yuanze Lin¹ Yunsheng Li² Dongdong Chen² Weijian Xu² Ronald Clark¹ Philip H. S. Torr¹

¹University of Oxford ²Microsoft

Given the user prompts, a trainable MLLM can perform routing across a wide range of specified models. In this concept, MLLMs can solve multimodal understanding tasks (e.g., VQA) with its inherited capacity, while MLLMs can allocate appropriate specialized models to address multimodal generative and classic vision tasks (e.g., image generation and depth estimation), then aggregate the results and deliver a response to the user.

Abstract

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness

Method

The Framework of Olympus. It can solve those tasks like VQA through the inherited capacities of MLLM directly. For other tasks, e.g., image editing, Olympus can generate the response, which consists of task-specific routing tokens and refined prompts, they are then used to schedule specialist models for addressing diverse user requests.

Video

Dataset Statistic

Versatile Capabilities

Olympus addresses a broad spectrum of vision tasks across images, videos, and even 3D content, in fact, it can cover over 20 different tasks.

Multimodal Understanding Performance

Multimodal evaluation across 11 distinct benchmarks.

Task Routing Performance

Evaluation results on OlympusBench under the single-task setting.

Evaluation results on OlympusBench under the chain-of-action setting.

Diverse Applications

More Results

Human Evaluation

BibTeX

@article{lin2024olympus,
  title={Olympus: A Universal Task Router for Computer Vision Tasks},
  author={Lin, Yuanze and Li, Yunsheng and Chen, Dongdong and Xu, Weijian and Clark, Ronald and Torr, Philip HS},
  journal={arXiv preprint arXiv:2412.09612},
  year={2024}
}