Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Photo by Kanhaiya Sharma / Unsplash


Original Paper: https://arxiv.org/abs/2404.14219

By: Marah AbdinSam Ade JacobsAmmar Ahmad AwanJyoti AnejaAhmed AwadallahHany AwadallaNguyen BachAmit BahreeArash BakhtiariJianmin BaoHarkirat BehlAlon BenhaimMisha BilenkoJohan BjorckSébastien BubeckQin CaiMartin CaiCaio César Teodoro MendesWeizhu ChenVishrav ChaudharyDong ChenDongdong ChenYen-Chun ChenYi-Ling ChenParul ChopraXiyang DaiAllie Del GiornoGustavo de RosaMatthew DixonRonen EldanVictor FragosoDan IterMei GaoMin GaoJianfeng GaoAmit GargAbhishek GoswamiSuriya GunasekarEmman HaiderJunheng HaoRussell J. HewettJamie HuynhMojan JavaheripiXin JinPiero KauffmannNikos KarampatziakisDongwoo KimMahoud KhademiLev KurilenkoJames R. LeeYin Tat LeeYuanzhi LiYunsheng LiChen LiangLars LidenCe LiuMengchen LiuWeishung LiuEric LinZeqi LinChong LuoPiyush MadanMatt MazzolaArindam MitraHardik ModiAnh NguyenBrandon NorickBarun PatraDaniel Perez-BeckerThomas PortetReid PryzantHeyang QinMarko RadmilacCorby RossetSambudha RoyOlatunji RuwaseOlli SaarikiviAmin SaiedAdil SalimMichael SantacroceShital ShahNing ShangHiteshi SharmaSwadheen ShuklaXia SongMasahiro TanakaAndrea TupiniXin WangLijuan WangChunyu WangYu WangRachel WardGuanhua WangPhilipp WitteHaiping WuMichael WyattBin XiaoCan XuJiahang XuWeijian Xu 

Abstract:

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.

The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.

The model is also further aligned for robustness, safety, and chat format.

We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).

Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.

Summary Notes

image

Figure: Toy illustration of the blocksparse attention in phi-3-small with 2 local blocks and vertical stride of 3.
The table shows the Keys/values a query token in block 8 attended to. Blue=local blocks, orange=remote/vertical
blocks, gray=blocks skipped.

Introduction

Imagine having a language model with near ChatGPT-level capabilities running directly on your smartphone. Sounds far-fetched, right?

Yet, this is precisely what Microsoft has achieved with their latest innovation, Phi-3-Mini. This 3.8 billion parameter language model, trained on a staggering 3.3 trillion tokens, is setting new benchmarks for performance and accessibility.

In this blog post, we will unpack the technical marvel that is Phi-3-Mini, exploring its methodologies, key findings, and the implications of running such a powerful AI locally on your device.


Training Methodologies: The Secret Sauce

The core of Phi-3-Mini’s success lies in its data-driven training methodology. Unlike the traditional approach of merely scaling up model parameters, Microsoft focused on optimizing the quality of the training data. Here are the key steps involved:

  1. High-Quality Data Filtering: The training dataset was meticulously curated, combining heavily filtered web data and synthetic data generated by other language models. This approach ensures that the model learns from high-quality and relevant information.
  2. Two-Phase Pre-Training: The training process was divided into two distinct phases. The first phase focused on general knowledge and language understanding using a broad range of web sources. The second phase fine-tuned the model's logical reasoning and niche skills using even more selectively filtered data.
  3. Data Optimal Regime: Instead of following the "compute optimal" or "over-train" regimes, Phi-3-Mini was trained under a "data optimal" regime. This means the training data was calibrated to optimize the model's learning capacity, prioritizing reasoning ability over the mere accumulation of factual knowledge.
  4. Post-Training Refinements: Post-training involved supervised fine-tuning (SFT) and direct preference optimization (DPO). These stages used high-quality datasets to improve the model's performance in diverse domains, including math, coding, and conversational safety.


Key Findings: Benchmarking Excellence

Phi-3-Mini’s performance is nothing short of impressive. Despite its smaller size, it rivals much larger models like GPT-3.5 and Mixtral in various academic benchmarks:

  • MMLU (5-Shot): 68.8%
  • HellaSwag (5-Shot): 76.7%
  • GSM-8K (8-Shot; CoT): 82.5%
  • Arc-C (10-Shot): 84.9%

These results highlight Phi-3-Mini’s exceptional reasoning and understanding capabilities, making it a formidable contender in the AI landscape.


Innovation in Deployment: Running Locally on Phones

One of the most groundbreaking aspects of Phi-3-Mini is its deployment capability. The model is small enough to run locally on modern smartphones. Here’s how it’s done:

  • Model Quantization: Phi-3-Mini can be quantized to 4-bits, reducing its memory footprint to approximately 1.8GB. This makes it feasible to run on devices like the iPhone 14 with an A16 Bionic chip.
  • Efficient Inference: Even in its quantized form, Phi-3-Mini generates over 12 tokens per second, ensuring smooth and responsive interactions.


Implications and Applications: The Future of AI Accessibility

The ability to run a highly capable language model locally on a phone has far-reaching implications:

  1. Privacy and Security: Local inference means sensitive data never leaves the user’s device, enhancing privacy and security.
  2. Offline Functionality: Users can leverage advanced AI capabilities without needing a constant internet connection, making it ideal for remote or low-connectivity environments.
  3. Democratization of AI: By making powerful AI accessible on everyday devices, Microsoft is democratizing the benefits of advanced language models, enabling a broader range of applications from personal assistants to educational tools.


Conclusion: A New Era of AI

Phi-3-Mini represents a significant leap forward in the field of language models. Its efficient training methodology, combined with the ability to deploy locally on smartphones, sets a new standard for what’s possible in AI.

As we look to the future, the implications of this technology will continue to unfold, bringing powerful AI tools into the hands of millions.

Stay tuned for more updates as Microsoft continues to push the boundaries of AI innovation.

Read more