Given the user prompts, a trainable MLLM can perform routing across a wide range of specified models. In this concept, MLLMs can solve multimodal understanding tasks (e.g., VQA) with its inherited capacity, while MLLMs can allocate appropriate specialized models to address multimodal generative and classic vision tasks (e.g., image generation and depth estimation), then aggregate the results and deliver a response to the user.
We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness
The Framework of Olympus It can solve those tasks like VQA through the inherited capacities of MLLM directly. For other tasks, e.g., image editing, Olympus can generate the response, which consists of task-specific routing tokens and refined prompts, they are then used to schedule specialist models for addressing diverse user requests.
Olympus addresses a broad spectrum of vision tasks across images, videos, and even 3D content, in fact, it can cover over 20 different tasks.
Multimodal evaluation across 11 distinct benchmarks.
Evaluation results on OlympusBench under the single-task setting.
Evaluation results on OlympusBench under the chain-of-action setting.
@article{lin2024olympus,
title={Olympus: A Universal Task Router for Computer Vision Tasks},
author={Lin, Yuanze and Li, Yunsheng and Chen, Dongdong and Xu, Weijian and Clark, Ronald and Torr, Philip HS},
journal={arXiv preprint arXiv:2412.09612},
year={2024}
}