Original Paper: https://arxiv.org/abs/2404.14219
By: Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu
Abstract:
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.
Summary Notes
Figure: Toy illustration of the blocksparse attention in phi-3-small with 2 local blocks and vertical stride of 3. The table shows the Keys/values a query token in block 8 attended to. Blue=local blocks, orange=remote/vertical blocks, gray=blocks skipped.
Introduction
Imagine having a language model with near ChatGPT-level capabilities running directly on your smartphone. Sounds far-fetched, right? Yet, this is precisely what Microsoft has achieved with their latest innovation, Phi-3-Mini. This 3.8 billion parameter language model, trained on a staggering 3.3 trillion tokens, is setting new benchmarks for performance and accessibility. In this blog post, we will unpack the technical marvel that is Phi-3-Mini, exploring its methodologies, key findings, and the implications of running such a powerful AI locally on your device.
Training Methodologies: The Secret Sauce
The core of Phi-3-Mini’s success lies in its data-driven training methodology. Unlike the traditional approach of merely scaling up model parameters, Microsoft focused on optimizing the quality of the training data. Here are the key steps involved:
- High-Quality Data Filtering: The training dataset was meticulously curated, combining heavily filtered web data and synthetic data generated by other language models. This approach ensures that the model learns from high-quality and relevant information.
- Two-Phase Pre-Training: The training process was divided into two distinct phases. The first phase focused on general knowledge and language understanding using a broad range of web sources. The second phase fine-tuned the model's logical reasoning and niche skills using even more selectively filtered data.
- Data Optimal Regime: Instead of following the "compute optimal" or "over-train" regimes, Phi-3-Mini was trained under a "data optimal" regime. This means the training data was calibrated to optimize the model's learning capacity, prioritizing reasoning ability over the mere accumulation of factual knowledge.
- Post-Training Refinements: Post-training involved supervised fine-tuning (SFT) and direct preference optimization (DPO). These stages used high-quality datasets to improve the model's performance in diverse domains, including math, coding, and conversational safety.
Key Findings: Benchmarking Excellence
Phi-3-Mini’s performance is nothing short of impressive. Despite its smaller size, it rivals much larger models like GPT-3.5 and Mixtral in various academic benchmarks:
- MMLU (5-Shot): 68.8%
- HellaSwag (5-Shot): 76.7%
- GSM-8K (8-Shot; CoT): 82.5%
- Arc-C (10-Shot): 84.9%
These results highlight Phi-3-Mini’s exceptional reasoning and understanding capabilities, making it a formidable contender in the AI landscape.
Innovation in Deployment: Running Locally on Phones
One of the most groundbreaking aspects of Phi-3-Mini is its deployment capability. The model is small enough to run locally on modern smartphones. Here’s how it’s done:
- Model Quantization: Phi-3-Mini can be quantized to 4-bits, reducing its memory footprint to approximately 1.8GB. This makes it feasible to run on devices like the iPhone 14 with an A16 Bionic chip.
- Efficient Inference: Even in its quantized form, Phi-3-Mini generates over 12 tokens per second, ensuring smooth and responsive interactions.
Implications and Applications: The Future of AI Accessibility
The ability to run a highly capable language model locally on a phone has far-reaching implications:
- Privacy and Security: Local inference means sensitive data never leaves the user’s device, enhancing privacy and security.
- Offline Functionality: Users can leverage advanced AI capabilities without needing a constant internet connection, making it ideal for remote or low-connectivity environments.
- Democratization of AI: By making powerful AI accessible on everyday devices, Microsoft is democratizing the benefits of advanced language models, enabling a broader range of applications from personal assistants to educational tools.
Conclusion: A New Era of AI
Phi-3-Mini represents a significant leap forward in the field of language models. Its efficient training methodology, combined with the ability to deploy locally on smartphones, sets a new standard for what’s possible in AI. As we look to the future, the implications of this technology will continue to unfold, bringing powerful AI tools into the hands of millions.
Stay tuned for more updates as Microsoft continues to push the boundaries of AI innovation.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →