Original Paper: https://arxiv.org/abs/2401.16818
By: Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati
Abstract:
We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
Summary Notes
Figure: Training logs. Training (top left) and validation (top right) cross-entropy loss, learning rate schedule (bottom left) and sequence length (bottom right). X-axis shows the number of tokens that have been trained up to the step.
Introduction
In the world of artificial intelligence, language models have become indispensable for various tasks, including text generation, translation, and summarization. The recent release of H2O-Danube, a series of 1.8B parameter language models, marks a significant advancement in this domain. Developed by H2O.ai, these models have been meticulously trained on extensive datasets and exhibit competitive performance across numerous benchmarks. This blog post delves into the intricacies of H2O-Danube, exploring its architecture, training methodologies, and the impressive results it achieves.
The Research Question
The primary objective of the H2O-Danube project was to develop highly efficient and competitive language models within the 2B parameter range. The research aimed to refine pre-training techniques, optimize training on diverse datasets, and ensure robust performance on various benchmarks while maintaining model efficiency.
Methodologies
Model Architecture
H2O-Danube models are built on the Llama 2 architecture, featuring approximately 1.8B parameters. The architecture includes:
- Hidden Size: 2,560
- Intermediate Size: 6,912
- Hidden Layers: 24
- Vocabulary Size: 32,000
- Context Length: 16,384
Key architectural components include:
- Sliding Window Attention: Adopted from Mistral and implemented in FlashAttention-2, this technique uses a fixed sliding window of 4,096.
- Rotary Positional Embedding (RoPE): Models dependencies of elements at different positions in a sequence.
- Grouped-Query Attention: Utilizes 32 attention heads and 8 key-value heads to reduce memory bandwidth overhead.
- RMS Normalization: Applied separately for pre- and post-normalization to stabilize training.
Training Process
The training of H2O-Danube models was conducted on a single node with 8xH100 GPUs using Distributed Data Parallel (DDP). The training involved several stages with varying sequence lengths to optimize token throughput and compute efficiency:
- 700B tokens with a sequence length of 2,048
- 100B tokens with a sequence length of 4,096
- 100B tokens with a sequence length of 8,192
- 100B tokens with a sequence length of 16,384
The models were trained using recent advances in 8-bit floating-point (FP8) calculations on the Hopper architecture and the AdamW optimizer with a cosine learning rate scheduler.
Key Findings
Evaluation Benchmarks
H2O-Danube-1.8B was evaluated against several open-source language models, including TinyLlama, Falcon, OPT, Cerebras-GPT, Pythia-deduped, Qwen, and Stable LM 2. The models were assessed using the Language Model Evaluation Harness framework across various benchmarks:
- Commonsense Reasoning: Evaluated using ARC easy and challenge, HellaSwag, OpenBookQA, PIQA, and Winogrande benchmarks.
- World Knowledge: Assessed with 5-shot performance on TriviaQA.
- Reading Comprehension: Evaluated using BoolQ.
- Open LLM Leaderboard: Benchmarked using a combination of ARC challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k.
Performance Highlights
H2O-Danube-1.8B demonstrated outstanding performance across all benchmarks, notably outperforming Qwen in most categories and being highly competitive with Stable LM 2 despite being trained on fewer tokens. Notably, H2O-Danube-1.8B achieved the highest ranking on the Hugging Face Open LLM Leaderboard for models below the 2B parameter range.
Implications and Applications
The advancements in H2O-Danube models have significant implications for the application of language models in various fields:
- Efficient Inference: The smaller model size allows for efficient inference on consumer hardware and edge devices.
- Open-Source Accessibility: The models are released under the Apache 2.0 license, promoting accessibility and democratization of LLMs.
- Enhanced Fine-Tuning: The release of chat variants optimized through supervised fine-tuning and direct preference optimization (DPO) enhances their usability in conversational AI applications.
Conclusion
H2O-Danube represents a substantial leap forward in the development of efficient and competitive language models. By refining pre-training techniques and optimizing training on diverse datasets, H2O.ai has created models that excel across multiple benchmarks. The open-source release under the Apache 2.0 license further democratizes access to advanced LLMs, paving the way for broader applications and future research.
The future of language models is here, and H2O-Danube is at the forefront, setting new standards for efficiency and performance in the AI landscape.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →