blogs

Quantized LoRA: Fine-Tuning Large Language Models with Ease

Haziqa Sajid

15 Nov 2024 — 8 min read

Language models like GPT-4 have established themselves as the leading standard in the NLP industry for developing advanced applications. These models can perform diverse tasks and easily adapt to new tasks using Prompt Engineering Techniques.

However, they also present a massive challenge around training and fine-tuning. Due to their massive parameters, training these models costs millions of dollars. Hence, we choose smaller models in production settings.

While smaller models offer advantages in terms of cost and deployment, they cannot generalize across multiple tasks. Consequently, we end up having multiple models for the different tasks and users.

This fragmentation increases the complexity of managing and maintaining various models. To address these challenges, PEFT techniques like LoRA (Low-Rank Adaptation) come in handy.

Instead of retraining large models fully, LoRA enables more efficient training by freezing the pre-trained weights and introducing trainable adapters within the Transformer architecture.

During fine-tuning, only adapters are updated for specific tasks and then merged back into the model without changing the original model's core parameters.

Building upon LoRA, Quantized Low-Rank Adaptation (QLoRA) further improves efficiency by combining quantization and achieves a 4x reduction in memory usage compared to standard LoRA.

Quantization reduces the precision of the model's weights to lower-bit formats, such as 4-bit or 8-bit integers, rather than using standard 32-bit floating-point numbers. By doing so, QLoRA further reduces memory usage and computational requirements to fine-tune large models.

Different fine-tuning methods and their memory requirements | Source

In this article, we’ll understand QLoRA, its method of quantization, its implementation, and its advantages.

What is Quantization?

Quantization is a deep learning technique that reduces the numerical precision of model weights and activations.

It converts high-precision floating-point numbers (such as 32-bit) to lower-precision formats such as 8-bit or even 4-bit integers.

In simple words, when you hear “quantizing,” think of splitting a range of numbers into buckets or bins. Quantization can greatly reduce a model's memory usage and accelerate its inference speed.

Quantization: FP32-bit to INT8 | Source

In quantization, converting a model's parameters involves two main steps: scaling and rounding.

Scaling: This step maps the range of values in the model's weights to a smaller range of integer values, such as -128 to 127, in an 8-bit format. This is achieved by multiplying the original weights by a scaling factor that brings them into the desired integer range.
Rounding: After scaling, the floating-point values are rounded to the nearest integer, thereby converting the continuous values into discrete ones.

Quantizing numbers via whole numbers or 10s | Source

Benefits of Quantization

Quantization offers several important benefits for deploying large-scale models on hardware with limited resources.

Reduced Memory: Quantization decreases the size of the model by reducing the number of bits required to store the parameters.
Faster Inference: Integer computations are generally faster than floating-point operations.
Lower Power Consumption: Quantized models require less power , benefiting mobile and edge devices.

Understanding QLoRA

QLoRA, or Quantized Low-Rank Adaptation, sets itself apart from standard fine-tuning methods by combining three new concepts with LoRA to maximize a machine’s limited memory without sacrificing model performance. These concepts are:

4-bit Normal Float
Double Quantization
Paged Optimizers

1. 4-bit Normal Float

QLoRA's first approach to quantization is 4-bit NormalFloat or NF4. It is a new information-theoretically optimal data type built on Quantile Quantization techniques. NF4 estimates the 2𝑘 + 1 quantiles (where 𝑘 is the number of bits) within a normalized range of [−1, 1].

This ensures that each quantization bin contains an equal number of data points, optimizing the representation of normally distributed values such as large language model weights.

Unlike standard quantization methods that divide data into equally spaced bins, NF4 adjusts bin sizes according to data density (equally-sized bins), keeping precision and dynamic range even within a limited 4-bit format.

Difference between equally-spaced and equally-sized buckets or bins | Source

2. Double Quantization

Double quantization in QLoRA further reduces memory usage by applying quantization to the constants from the initial quantization process. This two-step process starts with quantizing the model weights to a 4-bit precision using NormalFloat (NF4).

Next, the quantization constants (scales and zero points) from the first step are further quantized to a lower 8-bit precision.

This is important because QLoRA uses Block-wise quantization. Instead of quantizing all weights together, this method divides them into smaller blocks or chunks of weights, which are then quantized independently.

While this block-wise approach improves accuracy, it also generates multiple quantization constants, which can consume more memory.

By applying a second level of quantization to these constants, QLoRA effectively reduces their storage requirements and optimizes memory utilization during training or fine-tuning.

Comparison of standard vs block-wise quantization | Source

Paged Optimizers

Optimizers use Nvidia's unified memory feature to prevent out-of-memory errors during training.

When the GPU reaches its memory limit, it transfers memory pages to the CPU, similar to how memory is managed between CPU RAM and machine storage.

This memory paging feature specifically transfers optimizer states between the CPU and GPU as needed. It's important because sudden memory increases during training can stop the process.

Illustration of data transfers between the CPU and GPU | Source

Now, let's apply these techniques using the Hugging Face and PEFT libraries and explore how they streamline the QLoRA fine-tuning process step by step.

QLoRA Fine-tuning with Hugging Face: A Step-by-Step Guide

Hugging Face is an excellent library for working with and fine-tuning tasks. For QLoRA fine-tuning, we'll need the BitsAndBytes library for 4-bit quantization, which optimizes memory usage.

Additionally, we'll use the peft library for LoRA fine-tuning to make the adaptation process efficient for pre-trained models with minimal computation.

The code in this guide has been re-implemented from the notebook provided by HuggingFace.

1. Set Up the Environment

Start by installing the required libraries in your environment.

Copy


!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

2. Load and Quantize the Model

Next, load the pre-trained language model and tokenizer; we use GPT-neo-x-20B here! Note that the model itself is around 40GB in half precision. We'll then configure BitsandBytes to load the model with 4-bit precision using NF4 and double quantization.

Copy


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


# Specify the model ID for GPT-neo-x-20B
model_id = "EleutherAI/gpt-neox-20b"


# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,          # Load the model in 4-bit precision
    bnb_4bit_use_double_quant=True,  # Enable double quantization
    bnb_4bit_quant_type="nf4",      # Use 4-bit NormalFloat (NF4)
    bnb_4bit_compute_dtype=torch.bfloat16  # Set compute dtype to bfloat16
)


# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)


# Load the model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0}  # Load the model on the first available GPU
)

3. Prepare the Model for Training

Set up the model for low-bit training using the prepare_model_for_kbit_training method from the PEFT library. Enable gradient checkpointing to optimize memory usage.

Copy


from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

To verify the number of trainable parameters, define and call the following function:

Copy


def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0  # Initialize a counter for trainable parameters
    all_param = 0  # Initialize a counter for all parameters


    for _, param in model.named_parameters():
        all_param += param.numel()  # Add the total number of elements in the parameter to the 'all_param' counter
        if param.requires_grad:  
            trainable_params += param.numel()  
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"

4. Configure LoRA (Low-Rank Adaptation)

Now, let's apply LoRA to the model using the peft library. We'll configure LoRA with specific parameters like rank (r), alpha (lora_alpha), target modules, and dropout.

Copy

from peft import LoraConfig, get_peft_model


config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)


model = get_peft_model(model, config)
print_trainable_parameters(model

Output

Copy


trainable params: 8650752 || all params: 10597552128 || trainable%: 0.08162971878329976

5. Load and Prepare the Dataset

Load the dataset for fine-tuning. Here, we'll use a dataset of English quotes. Tokenize the data and handle padding.

Copy


from datasets import load_dataset


data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)


# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

6. Train the Model

Use the Hugging Face Trainer API for training. Configure the training arguments, such as batch size, learning rate, optimizer, number of training steps, and data collator.

Copy


import transformers


trainer = transformers.Trainer(
    model=model,  # The model to be fine-tuned
    train_dataset=data["train"],  # Training dataset
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,  # Batch size per GPU
        gradient_accumulation_steps=4,  # Accumulate gradients over multiple steps
        warmup_steps=2,  # Number of warmup steps for the learning rate scheduler
        max_steps=10,  # Total number of training steps
        learning_rate=2e-4,  # Learning rate
        fp16=True,  # Use mixed precision training (fp16)
        logging_steps=1,  # Log training information every step
        output_dir="outputs",  # Directory to save model checkpoints
        optim="paged_adamw_8bit",  # Use the paged AdamW optimizer for memory efficiency
        report_to="none"  # Disable reporting to online services like WandB
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),  # Data collator for language modeling
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Output

For the sake of the demo, we just ran it for only a few steps to showcase how to use this integration with existing tools in the HF ecosystem.

Steps and training loss after one epoch

7. Save and Test the Fine-tuned Model

After training, save the fine-tuned model and use it for inference.

Copy


model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model
model_to_save.save_pretrained("outputs")


# Load and test
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)


text = "Elon Musk "
device = "cuda:0"


inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output

Copy


Elon Musk 
Elon Musk is the founder of Tesla, SpaceX, and the Boring Company.

This comprehensive guide demonstrates fine-tuning a large language model using QLoRA with Hugging Face Transformers. Following these steps, you can effectively adapt large language models (LLMs) to specific tasks using QLoRA.

Advantages of QLoRA Over Standard LoRA

QLoRA has some important benefits compared to standard LoRA, mainly because it uses quantization. Here are some included:

Improved Memory Efficiency: QLoRA reduces memory usage compared to standard LoRA. Quantizing model weights to lower precision minimizes the model's size, enabling fine-tuning on consumer-grade GPUs with limited memory.
Faster Training: Reducing memory size through quantization leads to faster training times. This helps in quicker experimentation and iteration, accelerating the development process.
Comparable Performance: QLoRA achieves accuracy comparable to both full fine-tuning and standard LoRA.
Block-wise Quantization for Fine-Grained Control: QLoRA's use of block-wise quantization ensures that each segment of model weights is optimized independently, allowing flexibility and additional compression without affecting the model's performance.

Conclusion

The QLoRA method for fine-tuning large language models helps address the challenges of memory and computational costs. It combines two techniques, quantization, and Low-Rank Adaptation, to improve efficiency while maintaining good performance.

To understand its practical application, we demonstrated the implementation of QLoRA with HuggingFace by fine-tuning a GPT-NeoX-20B model on the English quotes dataset.

Quantized LoRA: Fine-Tuning Large Language Models with Ease

Haziqa Sajid

What is Quantization?

Benefits of Quantization

Understanding QLoRA

1. 4-bit Normal Float

2. Double Quantization

Paged Optimizers

QLoRA Fine-tuning with Hugging Face: A Step-by-Step Guide

1. Set Up the Environment

2. Load and Quantize the Model

3. Prepare the Model for Training

4. Configure LoRA (Low-Rank Adaptation)

5. Load and Prepare the Dataset

6. Train the Model

7. Save and Test the Fine-tuned Model

Advantages of QLoRA Over Standard LoRA

Conclusion

Additional Resources

Read more

How a Founder ran 100+ Voice Interviews in 48 Hours — without a Single Zoom Call, Powered by Dialog

Top 10 AI Agent Papers of the Week: 10th April - 18th April

Top 10 AI Agent Papers of the Week: 1st April - 8th April

Top 10 AI Agents Papers from March 2025