Quantized LoRA: Fine-Tuning Large Language Models with Ease
Language models like GPT-4 have established themselves as the leading standard in the NLP industry for developing advanced applications. These models can perform diverse tasks and easily adapt to new tasks using Prompt Engineering Techniques.
However, they also present a massive challenge around training and fine-tuning. Due to their massive parameters, training these models costs millions of dollars. Hence, we choose smaller models in production settings.
While smaller models offer advantages in terms of cost and deployment, they cannot generalize across multiple tasks. Consequently, we end up having multiple models for the different tasks and users.
This fragmentation increases the complexity of managing and maintaining various models. To address these challenges, PEFT techniques like LoRA (Low-Rank Adaptation) come in handy.
Instead of retraining large models fully, LoRA enables more efficient training by freezing the pre-trained weights and introducing trainable adapters within the Transformer architecture.
During fine-tuning, only adapters are updated for specific tasks and then merged back into the model without changing the original model's core parameters.
Building upon LoRA, Quantized Low-Rank Adaptation (QLoRA) further improves efficiency by combining quantization and achieves a 4x reduction in memory usage compared to standard LoRA.
Quantization reduces the precision of the model's weights to lower-bit formats, such as 4-bit or 8-bit integers, rather than using standard 32-bit floating-point numbers. By doing so, QLoRA further reduces memory usage and computational requirements to fine-tune large models.
Different fine-tuning methods and their memory requirements | Source
In this article, we’ll understand QLoRA, its method of quantization, its implementation, and its advantages.
What is Quantization?
Quantization is a deep learning technique that reduces the numerical precision of model weights and activations.
It converts high-precision floating-point numbers (such as 32-bit) to lower-precision formats such as 8-bit or even 4-bit integers.
In simple words, when you hear “quantizing,” think of splitting a range of numbers into buckets or bins. Quantization can greatly reduce a model's memory usage and accelerate its inference speed.
Quantization: FP32-bit to INT8 | Source
In quantization, converting a model's parameters involves two main steps: scaling and rounding.
- Scaling: This step maps the range of values in the model's weights to a smaller range of integer values, such as -128 to 127, in an 8-bit format. This is achieved by multiplying the original weights by a scaling factor that brings them into the desired integer range.
- Rounding: After scaling, the floating-point values are rounded to the nearest integer, thereby converting the continuous values into discrete ones.
Quantizing numbers via whole numbers or 10s | Source
Benefits of Quantization
Quantization offers several important benefits for deploying large-scale models on hardware with limited resources.
- Reduced Memory: Quantization decreases the size of the model by reducing the number of bits required to store the parameters.
- Faster Inference: Integer computations are generally faster than floating-point operations.
- Lower Power Consumption: Quantized models require less power , benefiting mobile and edge devices.
Understanding QLoRA
QLoRA, or Quantized Low-Rank Adaptation, sets itself apart from standard fine-tuning methods by combining three new concepts with LoRA to maximize a machine’s limited memory without sacrificing model performance. These concepts are:
- 4-bit Normal Float
- Double Quantization
- Paged Optimizers
1. 4-bit Normal Float
QLoRA's first approach to quantization is 4-bit NormalFloat or NF4. It is a new information-theoretically optimal data type built on Quantile Quantization techniques. NF4 estimates the 2𝑘 + 1 quantiles (where 𝑘 is the number of bits) within a normalized range of [−1, 1].
This ensures that each quantization bin contains an equal number of data points, optimizing the representation of normally distributed values such as large language model weights.
Unlike standard quantization methods that divide data into equally spaced bins, NF4 adjusts bin sizes according to data density (equally-sized bins), keeping precision and dynamic range even within a limited 4-bit format.
Difference between equally-spaced and equally-sized buckets or bins | Source
2. Double Quantization
Double quantization in QLoRA further reduces memory usage by applying quantization to the constants from the initial quantization process. This two-step process starts with quantizing the model weights to a 4-bit precision using NormalFloat (NF4).
Next, the quantization constants (scales and zero points) from the first step are further quantized to a lower 8-bit precision.
This is important because QLoRA uses Block-wise quantization. Instead of quantizing all weights together, this method divides them into smaller blocks or chunks of weights, which are then quantized independently.
While this block-wise approach improves accuracy, it also generates multiple quantization constants, which can consume more memory.
By applying a second level of quantization to these constants, QLoRA effectively reduces their storage requirements and optimizes memory utilization during training or fine-tuning.
Comparison of standard vs block-wise quantization | Source
Paged Optimizers
Optimizers use Nvidia's unified memory feature to prevent out-of-memory errors during training.
When the GPU reaches its memory limit, it transfers memory pages to the CPU, similar to how memory is managed between CPU RAM and machine storage.
This memory paging feature specifically transfers optimizer states between the CPU and GPU as needed. It's important because sudden memory increases during training can stop the process.
Illustration of data transfers between the CPU and GPU | Source
Now, let's apply these techniques using the Hugging Face and PEFT libraries and explore how they streamline the QLoRA fine-tuning process step by step.
QLoRA Fine-tuning with Hugging Face: A Step-by-Step Guide
Hugging Face is an excellent library for working with and fine-tuning tasks. For QLoRA fine-tuning, we'll need the BitsAndBytes library for 4-bit quantization, which optimizes memory usage.
Additionally, we'll use the peft library for LoRA fine-tuning to make the adaptation process efficient for pre-trained models with minimal computation.
The code in this guide has been re-implemented from the notebook provided by HuggingFace.
1. Set Up the Environment
Start by installing the required libraries in your environment.
Copy
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
2. Load and Quantize the Model
Next, load the pre-trained language model and tokenizer; we use GPT-neo-x-20B here! Note that the model itself is around 40GB in half precision. We'll then configure BitsandBytes to load the model with 4-bit precision using NF4 and double quantization.
Copy
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Specify the model ID for GPT-neo-x-20B
model_id = "EleutherAI/gpt-neox-20b"
# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Load the model in 4-bit precision
bnb_4bit_use_double_quant=True, # Enable double quantization
bnb_4bit_quant_type="nf4", # Use 4-bit NormalFloat (NF4)
bnb_4bit_compute_dtype=torch.bfloat16 # Set compute dtype to bfloat16
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map={"":0} # Load the model on the first available GPU
)
3. Prepare the Model for Training
Set up the model for low-bit training using the prepare_model_for_kbit_training method from the PEFT library. Enable gradient checkpointing to optimize memory usage.
Copy
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
To verify the number of trainable parameters, define and call the following function:
Copy
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0 # Initialize a counter for trainable parameters
all_param = 0 # Initialize a counter for all parameters
for _, param in model.named_parameters():
all_param += param.numel() # Add the total number of elements in the parameter to the 'all_param' counter
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
4. Configure LoRA (Low-Rank Adaptation)
Now, let's apply LoRA to the model using the peft library. We'll configure LoRA with specific parameters like rank (r), alpha (lora_alpha), target modules, and dropout.
Copy
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model
Output
Copy
trainable params: 8650752 || all params: 10597552128 || trainable%: 0.08162971878329976
5. Load and Prepare the Dataset
Load the dataset for fine-tuning. Here, we'll use a dataset of English quotes. Tokenize the data and handle padding.
Copy
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token
6. Train the Model
Use the Hugging Face Trainer API for training. Configure the training arguments, such as batch size, learning rate, optimizer, number of training steps, and data collator.
Copy
import transformers
trainer = transformers.Trainer(
model=model, # The model to be fine-tuned
train_dataset=data["train"], # Training dataset
args=transformers.TrainingArguments(
per_device_train_batch_size=1, # Batch size per GPU
gradient_accumulation_steps=4, # Accumulate gradients over multiple steps
warmup_steps=2, # Number of warmup steps for the learning rate scheduler
max_steps=10, # Total number of training steps
learning_rate=2e-4, # Learning rate
fp16=True, # Use mixed precision training (fp16)
logging_steps=1, # Log training information every step
output_dir="outputs", # Directory to save model checkpoints
optim="paged_adamw_8bit", # Use the paged AdamW optimizer for memory efficiency
report_to="none" # Disable reporting to online services like WandB
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), # Data collator for language modeling
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
Output
For the sake of the demo, we just ran it for only a few steps to showcase how to use this integration with existing tools in the HF ecosystem.
Steps and training loss after one epoch
7. Save and Test the Fine-tuned Model
After training, save the fine-tuned model and use it for inference.
Copy
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model
model_to_save.save_pretrained("outputs")
# Load and test
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)
text = "Elon Musk "
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output
Copy
Elon Musk
Elon Musk is the founder of Tesla, SpaceX, and the Boring Company.
This comprehensive guide demonstrates fine-tuning a large language model using QLoRA with Hugging Face Transformers. Following these steps, you can effectively adapt large language models (LLMs) to specific tasks using QLoRA.
Advantages of QLoRA Over Standard LoRA
QLoRA has some important benefits compared to standard LoRA, mainly because it uses quantization. Here are some included:
- Improved Memory Efficiency: QLoRA reduces memory usage compared to standard LoRA. Quantizing model weights to lower precision minimizes the model's size, enabling fine-tuning on consumer-grade GPUs with limited memory.
- Faster Training: Reducing memory size through quantization leads to faster training times. This helps in quicker experimentation and iteration, accelerating the development process.
- Comparable Performance: QLoRA achieves accuracy comparable to both full fine-tuning and standard LoRA.
- Block-wise Quantization for Fine-Grained Control: QLoRA's use of block-wise quantization ensures that each segment of model weights is optimized independently, allowing flexibility and additional compression without affecting the model's performance.
Conclusion
The QLoRA method for fine-tuning large language models helps address the challenges of memory and computational costs. It combines two techniques, quantization, and Low-Rank Adaptation, to improve efficiency while maintaining good performance.
To understand its practical application, we demonstrated the implementation of QLoRA with HuggingFace by fine-tuning a GPT-NeoX-20B model on the English quotes dataset.