Unlocking Affordable Brilliance: Fine-Tuning LLMs with LoRA for Maximum Cost-Effectiveness

by Aharsh MS · May 16, 2023

Welcome to our guide on Fine-Tuning LLMs with LoRA. Integrating AI and machine learning into various sectors has become a norm as we navigate the accelerating currents of technology. Amid this digital revolution, Large Language Models have emerged as a game-changing tool, opening up a world of untold possibilities.

However, these models are often critiqued for their substantial computational needs, leading to significant cost implications. Enter LoRA – an exciting new method poised to revolutionize the AI industry with its promise of fine-tuning LLMs at a fraction of the cost while retaining or enhancing their performance capabilities.

In this blog, we will dive deep into how combining LLMs and LoRA democratizes access to advanced AI and maximizes cost-effectiveness. We’ll explore what LoRA is, why it’s a game-changer for fine-tuning LLMs, and how it is making AI technology more accessible and affordable than ever before. Let’s get ready to unlock the brilliance without breaking the bank!

Getting to Know LoRA
Behind the Scenes with LoRA
The Advantages of Using LoRA
A Mini Wiki-How on Fine-Tuning LLMs with LoRA
Example: Fine-tuning T5 Cost-effectively with LoRA
Hacks to get the best out of LoRA

Getting to Know LoRA

LoRA: Low-Rank Adaptation of Large Language Models is an innovative technique developed by Microsoft researchers to address the challenges associated with fine-tuning large language models. In a recent study, Microsoft researchers compared LoRA to other fine-tuning techniques on various tasks. They discovered that LoRA could outperform other techniques while being significantly faster and more efficient. The researchers also discovered that LoRA could generalize to new tasks better than other techniques. Before jumping into the specifics of LoRA, let’s quickly brush up on the basics of fine-tuning and its relevance.

What is LLM fine-tuning?

Fine-tuning Large Language Models is an important step in harnessing their immense potential for various natural language processing tasks. Organizations can achieve higher performance, cost-effectiveness, and efficiency by customizing pretrained LLMs to specific domains, tasks, or contexts.

Pre-trained LLMs possess a wealth of knowledge acquired during their training. However, this knowledge is often generic and needs to be tailored to specific tasks or domains. Fine-tuning allows us to adapt these models to the nuances of a particular task, resulting in improved performance and better alignment with the desired objectives. By fine-tuning, we can equip LLMs to excel in various applications such as sentiment analysis, machine translation, text summarization, and more.

For example, pretrained models can be customized to comprehend and generate text specific to a particular domain’s terminology, jargon, or context. By training on task-specific datasets, LLMs can be fine-tuned to better understand and generate domain-specific language, enhancing accuracy and relevance in the desired domain. This transfer learning capability allows LLMs to generalize well to unseen data and tasks, paving the way for rapid deployment and scalability across a wide range of natural language processing applications.

Moreover, Fine-tuning LLMs is cost-effective compared to training models from scratch. Pre-training large models can be computationally expensive and time-consuming, but once pre-trained, they can serve as a starting point for various downstream tasks. By fine-tuning the existing models, we can significantly reduce the computational requirements and training time while achieving competitive performance. This efficiency makes LLMs and fine-tuning particularly appealing for organizations with limited resources or tight timelines.

See: Accubits’ Open-source Large Language Models Leaderboard

Fine-tuning is not a one-time process; it provides an avenue for iterative refinement and continuous learning. By evaluating the model’s performance on validation or test datasets, we can identify areas for improvement and fine-tune the LLM accordingly. This iterative process helps to fine-tune hyperparameters, adjust architecture, and enhance the model’s understanding of task-specific nuances, resulting in progressively better performance over time.

How LoRA works

LoRA offers an efficient way to fine-tune pre-trained language models for specific tasks without modifying the entire model. Instead of adjusting all parameters, LoRA introduces a smaller set of parameters to represent the desired adjustments. This low-rank representation reduces memory and computation requirements.

In a nutshell, LoRA offers a solution that accelerates training while consuming less memory using the following methods;

Preserving Pretrained Weights

LoRA takes a unique approach by freezing the previously pretrained weights of the model. This approach mitigates the risk of catastrophic forgetting, ensuring the model retains the valuable knowledge it acquired during pretraining. By keeping the pretrained weights intact, LoRA enables more effective adaptation without compromising the model’s existing capabilities.

Efficient Rank-Decomposition

The key feature of LoRA lies in adding rank-decomposition weight matrices, known as update matrices, to the existing weights. These update matrices have significantly fewer parameters than the original model, making them highly memory-efficient. By training only these newly added weights, LoRA achieves an accelerated training process with reduced memory requirements.

Integration with Attention Layers

LoRA matrices are typically incorporated into the attention layers of the original model. This integration enhances the model’s ability to capture and process complex language patterns. To implement LoRA, the Diffusers framework provides the load_attn_procs() method, which seamlessly loads the LoRA weights into a model’s attention layers. Fine-tuning can be controlled by adjusting the scale parameter, allowing for flexible adaptation to new training data.

Memory-Optimized Training

One of the remarkable advantages of LoRA is its memory efficiency. By leveraging the low-rank decomposition approach, the memory requirements for training large language models are significantly reduced. This allows running fine-tuning tasks on consumer-grade GPUs, such as the Tesla T4, RTX 3080, or even the RTX 2080 Ti. With readily accessible GPUs like the T4 available on platforms like Kaggle or Google Colab, the benefits of LoRA become accessible to a broader range of researchers and developers.

Neural networks often include dense layers, which involve matrix multiplication using weight matrices that typically have full rank. Pre-trained language models have a low “intrinsic dimension,” allowing them to learn effectively even when randomly projected onto a smaller subspace. Building on this idea, LoRA proposes that weight updates during adaptation also exhibit a low “intrinsic rank.” To impose constraints on the update of a pre-trained weight matrix W0 ∈ R^d×k, it is expressed as a low-rank decomposition W0 + ∆W = W0 + BA, where B ∈ R^d×r, A ∈ R^r×k, and the rank r is the minimum of d and k.

During training, W0 remains unchanged and does not receive gradient updates, while A and B incorporate trainable parameters. The input is multiplied by both W0 and ∆W = BA, and their respective output vectors are summed element-wise. We denote h = W0x, resulting in our modified forward pass: h = W0x + BAx.

A is initialized randomly with a Gaussian distribution, while B is initially set to zero to ensure that ∆W = BA is zero at the start of training. We scale ∆Wx by α^r, where α is a constant specific to r. When using Adam optimization, tuning α is similar to tuning the learning rate, given the appropriate initialization scaling. Therefore, we set α to the first value of r we try without further adjustment. This scaling helps minimize the need for fine-tuning hyperparameters when varying r. This reparametrization reduces the necessity of hyperparameter retuning when changing the rank.

LoRA addresses these challenges by utilizing a low-rank parameterization for weight matrices in neural networks. It hypothesizes that weight matrices have a lower “intrinsic rank” and represent weight updates with a low-rank decomposition.

However, LoRA is limited in batch processing inputs from different tasks with different matrices, especially when minimizing inference latency. These limitations can be addressed by not merging the weights and dynamically selecting the appropriate LoRA modules for each sample.

Yet LoRA provides an efficient and parameter-effective approach to fine-tuning pre-trained language models, enabling reduced memory usage, faster training, and cost-effective task switching during deployment.

The Advantages of Using LoRA

In a paper published in 2021, the authors evaluated the technique on various tasks, including text classification, question answering, and summarization. They demonstrated that LoRA outperformed full fine-tuning with a fraction of the training time and memory requirements.

LoRA introduces a more efficient method for fine-tuning large language models. LoRA significantly reduces the number of trainable parameters and GPU memory requirements by freezing the pretrained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. Despite having fewer trainable parameters, LoRA performs on par or better than traditional fine-tuning methods on models like RoBERTa, DeBERTa, GPT-2, and GPT-3.

This is a snippet of the evaluation results from the 2021 paper, LORA: Low-Rank Adaptation of Large Language Models, by Edward Hu et al.

This compares the performance of various adaptation methods on GPT-3 175B. The models were tested on three tasks: logical form validation accuracy on WikiSQL, MultiNLI-matched validation accuracy, and Rouge-1/2/L on SAMSum. LoRA outperformed all other methods, including full fine-tuning. For the three metrics, WikiSQL had a fluctuation of 0.5%, MNLI-m had a fluctuation of 0.1%, and SAMSum had a fluctuation of 0.2/0.2/0.1.

In other words, LoRA is a more efficient and effective method of adapting GPT-3 175B to new tasks. It can achieve results comparable to or better than full fine-tuning while requiring less time and resources.

There are several benefits to using LoRA, including;

Speed: LoRA is much faster than other fine-tuning techniques. This is because LoRA only updates a small number of parameters rather than all of the parameters in the LLM.

Efficiency: LoRA is more efficient than other fine-tuning techniques regarding memory and storage requirements. This is because LoRA only needs to store the low-rank matrix rather than all of the parameters in the LLM.

Generalization: LoRA is better at generalizing to new tasks than other fine-tuning techniques. This is because LoRA only updates the parameters relevant to the target task.

How to Fine-Tuning LLMs with LoRA

Now that we have a basic understanding of LoRA, let’s look at the practical aspects of how to use LoRA to fine-tune an LLM. First, prepare your data by gathering a relevant dataset for fine-tuning the LLM. Ensure that the dataset is sufficiently large. Next, select a LoRA implementation. Configure the implementation by specifying parameters. Then, train the LoRA approximation by inputting the prepared data. Finally, fine-tune the LLM using the trained LoRA approximation, feeding it both the initially gathered data and the LoRA approximation. The following steps give a general overview of utilizing LoRA to fine-tune an LLM:

Understaning LLMOps or Large Language Model Operations

Prepare your data: You will need a text or code dataset relevant to your task to fine-tune the LLM. The dataset should be large enough to provide the LLM with enough data to learn from.

Choose a LoRA implementation: There are several LoRA implementations available. You can choose one that is based on your programming language of choice.

Configure the LoRA implementation: The LoRA approximation has certain factors that significantly impact its performance. One such factor is the rank of the approximation, which determines its accuracy. Higher ranks result in more precise approximations and require increased memory usage and computation time. Another crucial factor is the learning rate, which governs the speed at which the LoRA approximation is updated. A higher learning rate leads to faster updates, while a lower rate results in slower ones. Finally, the number of epochs plays a vital role in training the LoRA approximation. The epochs dictate the duration required to train the LoRA approximation fully. By adjusting these factors, the overall efficiency and effectiveness of the LoRA approximation can be optimized.

Train the LoRA approximation: After configuring the LoRA implementation, you can train the LoRA approximation. This will involve feeding the LoRA approximation of the data you prepared initially.

Fine-tune the LLM using the LoRA approximation: Once the LoRA approximation has been trained, you can use it to fine-tune it. This will involve feeding the LLM the initial data and the LoRA approximation.

These are a few general steps to help you understand how to approach fine-tuning with LoRA. Now, I’ll give you a more specific example.

Example: Fine-tuning T5 with LoRA

As a demonstration, Here is an example that exemplifies how to fine-tune the T5 language model using LoRA cost-effectively. The code below fine-tunes T5 on a text dataset using LoRA. The LoRA approximation will be trained on a subset of the data, and then T5 will be fine-tuned using the LoRA approximation on the entire dataset. The fine-tuned model will be saved to a file. The steps to fine-tune are as follows;

Environment Setup: Ensure you have Python installed and the necessary libraries. The main ones are PyTorch, Transformers, and HuggingFace Datasets, which you can install via pip:

pip install torch
pip install transformers
pip install datasets

Load the Pretrained T5 Model and Tokenizer: Specify which version of the T5 model to load. In this case, it’s the “base” version, a smaller and faster version of the model. This tokenizer converts text data into a format the T5 model can understand.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

Prepare your Dataset: You need to prepare your dataset for the task at hand. For simplicity, we’ll use the HuggingFace Datasets library to load the “squad” dataset, but you can replace this with your dataset:

from datasets import load_dataset

dataset = load_dataset('squad')

# Tokenize the dataset
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

dataset = dataset.map(tokenize, batched=True)
dataset.set_format('torch', columns=['input_ids', 'attention_mask'])

Define the Low-Rank Matrix: You need to define a low-rank matrix that will be used for fine-tuning. The exact dimensions and initialization of this matrix will depend on the specifics of the LORA method and the architecture of the T5 model:

import torch

rank = 10  # This is a hyperparameter that you'll need to choose
lr_matrix = torch.randn((model.config.d_model, rank))

Modify the Forward Pass of the Model: You need to modify the forward pass of the T5 model to include the low-rank matrix. This will involve some understanding of the internal workings of the T5 model and the specifics of the LORA method:

# Placeholder for the modified forward pass
def new_forward(*args, **kwargs):
    # Here you'll need to include the multiplication with the low-rank matrix
    pass

# Replace the existing forward method with the new one
model.forward = new_forward

Fine-Tuning: Finally, you can fine-tune the model on your dataset. You’ll need to define a suitable optimizer and loss function:

from torch.optim import Adam

optimizer = Adam(model.parameters())

for epoch in range(num_epochs):
    for batch in dataset:
        optimizer.zero_grad()
        
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

This approach is cost-effective because it allows you to fine-tune T5 on a large dataset without training the entire model from scratch. The LoRA approximation is a small, efficient model that can be trained quickly and easily. Enabling you to fine-tune T5 without wasting time or money.

Hacks to Get the Best out of LoRA

Though LoRA is a powerful technique that shows incredible potential, there are some ways to ensure that its performance is at its best. Ensuring the training set is large enough with a good optimization algorithm, evaluating the model on a holdout set, regularizing the model, and using a small rank. By following these good practices, you can get the best possible performance from LoRA.

Use a large enough training set. The training set’s size will affect the LoRA model’s performance. A larger training set will generally lead to a better-performing model.
Use a good optimization algorithm. The choice of an optimization algorithm can affect the performance of the LoRA model. A good optimization algorithm will help the model converge to a good solution.
Evaluate the model on a holdout set. Evaluating the model on a holdout set that was not used for training is important. This will give us a more accurate estimate of the model’s performance.
Regularize the model. Regularization can prevent the model from overfitting the training data. There are a variety of regularization techniques that can be used, such as L2 regularization.
Use a small rank. The rank of the LoRA model can affect its performance. A smaller rank will generally lead to a faster training time and a smaller model size. However, a smaller rank may also lead to lower performance. It is important to experiment with different ranks to find the best tradeoff between performance and speed.

So, to Sum up:

Though LoRA is still under development, it can make LLMs more accessible to a wider range of users. It overcomes the challenges associated with traditional fine-tuning techniques by reducing the number of trainable parameters and GPU memory requirements. Fine-Tuning LLMs with LoRA enables quick task switching while maintaining high model quality without introducing inference latency. LoRA is an exciting new technique for fine-tuning LLMs that has the potential to reduce training and deployment costs significantly.

fine-tune Generative AI large language models llm LORA

Written by

Aharsh MS

Aharsh is a tech entrepreneur, a visionary. He believes that each of us is here to do something purposeful, something which counts to leverage our species beyond any limits. And greatly inspired by thoughts and ideas supporting such purposes, to building a future with profound possibilities, a future where technology negates all human miseries!

Getting to Know LoRA

How LoRA works

The Advantages of Using LoRA

How to Fine-Tuning LLMs with LoRA

Example: Fine-tuning T5 with LoRA

Hacks to Get the Best out of LoRA

So, to Sum up:

Aharsh MS

Related articles

The 3 Most Common Reasons AI Initiatives Fail — and How Outcome as a Service (OaaS) Prevents Them

Becoming AI-Ready with Model Context Protocol (MCP) Servers

Generative AI in Banking and Financial Services