Original Paper: https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
By: Gemma Team, Google DeepMind1
Abstract:
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.
Summary Notes
Figure: | Language understanding and generation performance of Gemma 7B across different capabilities compared to similarly sized open models. We group together standard academic benchmark evaluations by capability and average the respective scores; see Table 6 for a detailed breakdown of performance.
In the ever-evolving landscape of artificial intelligence, the development of powerful and efficient language models remains a cornerstone. Today, we're excited to introduce Gemma, a family of state-of-the-art open models derived from the research and technology behind Google's Gemini models. With impressive capabilities and a focus on safety and responsibility, Gemma is set to redefine what's possible with lightweight language models.
What is Gemma?
Gemma represents a significant leap forward in the field of language models. Built on the foundation of Google's Gemini models, Gemma offers robust performance across various academic benchmarks, including language understanding, reasoning, and safety. The family includes two model sizes—2 billion and 7 billion parameters—catering to different computational needs and applications. Both pretrained and fine-tuned checkpoints are available, providing a versatile toolkit for developers and researchers.
Key Features:
- Strong Performance: Outperforms similarly sized open models on 11 out of 18 text-based tasks.
- Versatile Deployment: Available in two sizes (2B and 7B parameters) for different computational environments.
- Safety and Responsibility: Comprehensive evaluations of safety and responsibility aspects.
How Gemma Works: Key Methodologies
Gemma models are built using advanced architectures and training techniques inspired by the Gemini family. Let's dive into the core methodologies that make Gemma stand out:
Transformer Architecture
At the heart of Gemma lies the transformer decoder architecture, a proven framework for handling various natural language processing (NLP) tasks. The models are trained on a context length of 8192 tokens, ensuring they can manage extensive and complex inputs effectively.
Training Infrastructure
Gemma models are trained using TPUv5e chips, deployed in massive pods configured into a 2D torus. For the 7B model, training spans 16 pods (totaling 4096 TPUv5e chips), while the 2B model utilizes 2 pods (512 TPUv5e chips). This robust infrastructure allows for efficient and scalable model training.
Improvements and Innovations
Several key enhancements have been integrated into the transformer architecture:
- Multi-Query Attention: Optimizes the attention mechanism, particularly effective at smaller scales.
- RoPE Embeddings: Uses rotary positional embeddings to reduce model size while maintaining performance.
- GeGLU Activations: Replaces standard ReLU non-linearity with the GeGLU activation function for better performance.
- RMSNorm: Stabilizes training by normalizing inputs at various transformer sub-layers.
Data and Pretraining
Gemma models are trained on vast datasets comprising web documents, mathematics, and code. The 2B and 7B models are trained on 3 trillion and 6 trillion tokens, respectively. The training data is filtered rigorously to minimize the risk of unwanted or unsafe outputs.
Filtering Techniques:
- Heuristics and Model-Based Classifiers: Used to remove harmful or low-quality content.
- Contamination Analyses: Ensures no leakage from evaluation sets into training data.
Main Findings and Results
Gemma models exhibit exceptional performance across a range of benchmarks, often surpassing other open models of similar or even larger sizes. Here are some highlights:
Automated Benchmarks
Gemma models excel in various domains, including:
- Physical and Social Reasoning (ARC, CommonsenseQA): Achieves higher scores compared to other open models.
- Mathematics and Coding (GSM8K, HumanEval): Outperforms alternatives by at least 10 points on GSM8K and demonstrates superior coding capabilities.
Human Preference Evaluations
In human side-by-side evaluations, Gemma models consistently outshine Mistral v0.2 7B Instruct models:
- Instruction Following: Gemma 7B IT achieves a 61.2% win rate.
- Safety Protocols: Gemma 7B IT achieves a 63.5% win rate.
Safety Evaluations
Safety and responsibility are paramount in the development of Gemma. The models undergo rigorous testing to ensure they minimize the risk of generating harmful or biased content. Gemma outperforms competitors on six standard safety benchmarks, demonstrating its commitment to responsible AI development.
Implications and Potential Applications
The release of Gemma models opens up numerous possibilities for both research and practical applications. Here are some potential areas where Gemma can make a significant impact:
Research and Innovation
By providing open access to both pretrained and fine-tuned checkpoints, Gemma encourages further research into AI safety, transparency, and interpretability. Researchers can build upon Gemma's robust foundation to explore new frontiers in NLP and AI.
Real-World Applications
Gemma's versatile capabilities make it suitable for various applications, including:
- Dialogue Systems: Enhancing chatbots and virtual assistants with improved understanding and responsiveness.
- Educational Tools: Developing interactive learning platforms that provide accurate and engaging content.
- Scientific Research: Assisting in the analysis and generation of scientific literature and data.
Conclusion
Gemma represents a new era in the development of lightweight, open-source language models. With its strong performance, commitment to safety, and versatile deployment options, Gemma is poised to drive the next wave of innovations in AI. As we continue to explore the capabilities and potential of Gemma, we invite the AI community to join us in this exciting journey.
Whether you're a researcher, developer, or enthusiast, Gemma offers a powerful and responsible toolset to push the boundaries of what's possible with language models. Let's build the future of AI together, one token at a time.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →