Mixtral of Experts

Mixtral of Experts
Photo by Google DeepMind / Unsplash


Original Paper: https://arxiv.org/abs/2401.04088

By: Albert Q. JiangAlexandre SablayrollesAntoine RouxArthur MenschBlanche SavaryChris BamfordDevendra Singh ChaplotDiego de las CasasEmma Bou HannaFlorian BressandGianna LengyelGuillaume BourGuillaume LampleLélio Renard LavaudLucile SaulnierMarie-Anne LachauxPierre StockSandeep SubramanianSophia YangSzymon AntoniakTeven Le ScaoThéophile GervetThibaut LavrilThomas WangTimothée LacroixWilliam El Sayed

Abstract:

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).

For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep.

As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks.

In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.

We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks.

Both the base and instruct models are released under the Apache 2.0 license.

Summary Notes

image

Figure: Mixture of Experts Layer. Each input vector is assigned to 2 of the 8 experts by a router. The
layer’s output is the weighted sum of the outputs of the two selected experts. In Mixtral, an expert is a standard feedforward block as in a vanilla transformer architecture.

Introduction

In the ever-evolving landscape of artificial intelligence, language models continue to push the boundaries of what's possible.

The recent introduction of Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model, is a testament to this progress. Developed by a team of researchers, Mixtral is designed to optimize performance while minimizing computational costs.

This blog post delves into the intricacies of Mixtral, exploring its architecture, key findings, and potential implications for the future of AI.


Key Methodologies

Mixtral leverages a Sparse Mixture of Experts (SMoE) architecture, which differentiates it from conventional dense models. Here's a breakdown of the core methodologies that underpin Mixtral:

  1. Mixture of Experts Layer: Each layer in Mixtral comprises eight feedforward blocks (experts). For every token processed, a router network selects two experts to handle the task. This selective approach allows the model to use only a fraction of its parameters at any given time, significantly enhancing computational efficiency.
  2. Gating Mechanism: The gating network determines the experts' selection. It uses a softmax function over the top-K logits of a linear layer, ensuring that only the most relevant experts are activated for each token. This setup allows the model to scale its parameters while keeping computational costs constant.
  3. Multilingual Pretraining: Mixtral is pre-trained with a context size of 32k tokens, utilizing multilingual data. This extensive context window enables the model to excel in tasks requiring long-range dependencies and multilingual understanding.


Main Findings and Results

The Mixtral 8x7B model has demonstrated remarkable performance across various benchmarks, often outperforming larger models like Llama 2 70B and GPT-3.5. Here are some key highlights:

  1. Benchmark Performance: Mixtral matches or exceeds the performance of Llama 2 70B and GPT-3.5 across a wide range of benchmarks, including commonsense reasoning, word knowledge, reading comprehension, mathematics, and code generation. Notably, it outperforms Llama 2 70B significantly in mathematics and code-related tasks.
  2. Efficiency: Despite using only 13B active parameters per token (compared to Llama 2 70B's 70B), Mixtral achieves superior or comparable results. This efficiency is attributed to its SMoE architecture, which optimizes computational resources.
  3. Instruction Following: The fine-tuned Mixtral 8x7B – Instruct model surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B-chat in human evaluation benchmarks. This version also exhibits reduced biases and a more balanced sentiment profile.
  4. Multilingual Capabilities: Mixtral's performance on multilingual benchmarks is particularly impressive. It outperforms Llama 2 70B in languages like French, German, Spanish, and Italian, thanks to the upsampled proportion of multilingual data during pretraining.


Implications and Potential Applications

The innovative architecture and superior performance of Mixtral have several significant implications and potential applications:

  1. Scalability: Mixtral's SMoE architecture allows for scalability without a proportional increase in computational costs. This makes it a viable option for large-scale deployments where resource efficiency is crucial.
  2. Multilingual NLP: Given its strong performance in multilingual benchmarks, Mixtral can be a powerful tool for applications requiring multilingual natural language processing, such as translation services, global customer support, and cross-lingual information retrieval.
  3. Specialized Task Performance: Mixtral's proficiency in mathematics and code generation opens up possibilities for its use in educational tools, automated coding assistants, and complex problem-solving applications in various scientific domains.
  4. Bias Mitigation: The reduced biases observed in Mixtral–Instruct suggest its potential to develop fairer and more balanced AI systems, which is crucial in sensitive applications like hiring, law enforcement, and social media moderation.


Conclusion

Mixtral 8x7B represents a significant advancement in the field of language models. By effectively combining a Sparse Mixture of Experts architecture with extensive multilingual pretraining, it achieves remarkable performance while maintaining computational efficiency.

As the AI community continues to explore and refine these techniques, the potential applications of models like Mixtral are vast and varied.

The open-source nature of Mixtral ensures that researchers and developers worldwide can contribute to and benefit from this groundbreaking work, driving further innovation in AI.


For those interested in exploring Mixtral further, the model and its instruct version are available under the Apache 2.0 license, providing broad accessibility for both academic and commercial use.

Quote from Researchers
"Mixtral's architecture showcases how strategic parameter utilization can lead to significant performance gains without commensurate increases in computational costs," says Albert Q. Jiang, one of the lead researchers.

Future Research Directions:

While Mixtral has set a new benchmark, there are still areas ripe for exploration. Future research could focus on optimizing the routing mechanism to further reduce latency, improving load balancing across GPUs, and exploring additional fine-tuning techniques to enhance specialized task performance.


Suggested Visuals:

  1. Diagram of the Mixtral Architecture: An infographic highlighting the Sparse Mixture of Experts layer and the gating mechanism.
  2. Benchmark Performance Chart: A comparative graph showing Mixtral's performance against Llama 2 70B and GPT-3.5 across various benchmarks.


By sharing these insights, we hope to inspire further advancements in AI and contribute to the ongoing dialogue within the engineering community about the future of language models.

Read more