Original Paper: https://arxiv.org/abs/2403.09611
By: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
Abstract
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
Summary Notes
Figure: MM1 can perform in-context predictions thanks to its large-scale multimodal pre-training. This allows MM1 to (a) count objects and follow custom formatting (b) refer to parts of the images and perform OCR (c) demonstrate common-sense and word knowledge about everyday objects (d) perform basic math functions. Images are from the COCO 2014 validation set [72].
In the realm of artificial intelligence, the fusion of language understanding and image interpretation marks a significant milestone. Multimodal Large Language Models (MLLMs) are at the forefront of this evolution, integrating textual and visual data to produce coherent, context-aware outputs. This blog post delves into the intricate process of developing MM1, a state-of-the-art MLLM, highlighting the methodologies, key findings, and potential applications of this groundbreaking research.
Introduction
The advent of Large Language Models (LLMs) and Vision Foundation Models has transformed how machines interpret language and images. However, the next frontier lies in MLLMs, which merge these capabilities to achieve superior performance across diverse tasks. In this post, we explore the journey of creating MM1, a family of MLLMs that sets new benchmarks in few-shot learning and multi-image reasoning.
Key Methodologies
- Model Architecture: The core of MM1's architecture involves a strategic combination of image encoders and vision-language connectors. The image encoder, based on the Vision Transformer (ViT) model, is pre-trained using a CLIP objective on large-scale datasets. The vision-language connector translates visual features into the LLM's space, with configurations like C-Abstractor ensuring efficient token mapping.
- Data Composition: MM1's success hinges on a meticulously crafted data mixture comprising image-caption pairs, interleaved image-text documents, and text-only sequences. This blend is crucial for balancing zero-shot and few-shot performance, enabling the model to retain language comprehension while excelling in multimodal tasks.
- Training Procedure: The training regimen involves scaling the model from 3 billion to 30 billion parameters, incorporating mixture-of-experts (MoE) strategies to enhance capacity without compromising speed. An optimal learning rate is extrapolated from smaller models, ensuring efficient scaling in the training process.
Main Findings and Results
- Image Resolution and Tokens: Higher image resolutions significantly boost performance across metrics, with 336px proving optimal for capturing detailed visual information. Increasing the number of image tokens also enhances the model's ability to process complex visual inputs.
- Data Mixing Strategy: The inclusion of interleaved data is pivotal for few-shot and text-only tasks, while caption data is essential for zero-shot performance. A balanced mix of these data types ensures robust multimodal capabilities.
- Model Scaling and MoE: Scaling MM1 to 30 billion parameters, along with MoE implementations, results in a family of models that outperform existing MLLMs in few-shot evaluations on tasks like captioning and visual question answering.
Implications and Applications
The development of MM1 opens new avenues for applications across various domains:
- Enhanced In-Context Learning: MM1's ability to perform in-context predictions and multi-image reasoning is invaluable for applications in education, content creation, and interactive AI systems.
- Few-Shot Chain-of-Thought Reasoning: The model's proficiency in few-shot learning enables it to tackle complex reasoning tasks with minimal examples, making it suitable for scientific research, legal analysis, and decision support systems.
- Real-World Integration: From autonomous navigation to personalized virtual assistants, MM1's advanced multimodal understanding can be integrated into real-world applications, enhancing user interaction and decision-making processes.
Conclusion
The journey of building MM1 underscores the importance of integrating diverse data types, optimizing model architectures, and employing scalable training techniques. As MLLMs continue to evolve, the insights gained from MM1 will guide future developments, ensuring these models remain at the cutting edge of AI capabilities.
In the ever-expanding landscape of artificial intelligence, MM1 stands as a testament to the power of merging language and vision, paving the way for more intelligent, versatile, and human-like AI systems.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →