Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Chunyuan Li; Zhe Gan; Zhengyuan Yang; Jianwei Yang; Linjie Li; Lijuan Wang; Jianfeng Gao

Journal ArticleOPEN ACCESS

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Foundations and Trends in Computer Graphics and Vision (2024) 16(1-2) 1-214

DOI: 10.1561/0600000110

8Citations

249Readers

Get full text

Abstract

This monograph presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics – methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics – unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the monograph are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

Cite

CITATION STYLE

APA

Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., & Gao, J. (2024). Multimodal Foundation Models: From Specialists to General-Purpose Assistants. Foundations and Trends in Computer Graphics and Vision, 16(1–2), 1–214. https://doi.org/10.1561/0600000110

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Abstract

Cite

Register to see more suggestions