-- DRAFT --

LoRA and the Efficiency Revolution

Fine-tuning a 7-billion-parameter language model used to require a cluster of A100 GPUs, hundreds of gigabytes of memory, and a budget to match. Today, you can do it on a single consumer GPU by updating less than 1% of the weights. The trick is a simple linear algebra insight that changed everything.

The story of LoRA (Low-Rank Adaptation) is a story about waste. When you fine-tune a large language model, you update every single parameter. For a 7B model, that means adjusting 7 billion floating-point numbers, even though the downstream task might only require a tiny adjustment to the model's behavior. It's like renovating an entire house because you wanted to repaint one room.

Craig Trim

SLP3 §10.4 makes explicit what "full fine-tuning" means at the architectural level. When fine-tuning BERT for classification, you add a single classifier head on top and then update the entire pretrained network end-to-end via backpropagation. Jurafsky and Martin note that even the embedding layers, all the Transformer blocks, and the new head all receive gradient updates. For a 7B model, that means 7 billion parameters being nudged by every single training example. LoRA's insight is that most of those nudges are redundant.

In 2021, a team at Microsoft Research proposed something different: what if most of those weight updates are redundant? What if the meaningful changes during fine-tuning live in a much smaller space than the full parameter count suggests?

They were right. And the consequences have been enormous.

The Full Fine-Tuning Problem

To understand why LoRA matters, you need to understand what full fine-tuning actually costs. The numbers are worse than most people realize.

A 7B parameter model (like LLaMA-2 7B) stores its weights as floating-point numbers. In full precision (fp32), each parameter occupies 4 bytes. That's 28 GB just for the model weights. But training requires far more than just storing the weights.

Craig Trim

Widdows and Cohen provide useful context on floating-point precision in Section 5.3.5. They trace how the industry moved from 64-bit doubles as best practice (circa 2007) to FP16 and even FP8 formats, noting that neural network weights are clustered around zero with a few significant outliers, making reduced-precision formats surprisingly effective. This directly frames why the fp32 costs described here are often unnecessarily expensive. Widdows & Cohen, Issue #45

The Memory Arithmetic

During training, you need to hold three additional objects in memory for every parameter:

Gradients: One gradient value per parameter, computed during the backward pass. Another 28 GB in fp32.
Optimizer first moment (mean): The Adam optimizer maintains a running average of gradients. Another 28 GB.
Optimizer second moment (variance): Adam also tracks the running average of squared gradients. Another 28 GB.

Add them up: 28 GB (weights) + 28 GB (gradients) + 28 GB (first moment) + 28 GB (second moment) = 112 GB. And that's before you account for activations (the intermediate values stored during the forward pass for use in backpropagation), which can easily add another 50-100 GB depending on batch size and sequence length.

Craig Trim

SLP3 §6.6.3 explains why activations consume so much memory. During backpropagation, the chain rule requires multiplying each layer's local gradient by the activations from the forward pass. Every intermediate result must be stored until the backward pass reaches that layer. Jurafsky and Martin illustrate this with computation graphs where each node's gradient depends on values computed in the forward direction. For a 32-layer Transformer with batch size 4 and sequence length 2048, that is 32 layers times the full hidden-state tensor at each step, which is where the 50-100 GB activation memory comes from.

Component	7B Model (fp32)	13B Model (fp32)	70B Model (fp32)
Model weights	28 GB	52 GB	280 GB
Gradients	28 GB	52 GB	280 GB
Optimizer (Adam)	56 GB	104 GB	560 GB
Activations (est.)	50-100 GB	100-200 GB	500+ GB
Total	~160-210 GB	~310-410 GB	~1.6+ TB

Memory requirements for full fine-tuning with Adam optimizer in fp32.
An A100 GPU has 80 GB of VRAM.

A single A100 GPU (the workhorse of modern AI) has 80 GB of VRAM. Full fine-tuning of a 7B model in fp32 doesn't fit on one card. A 70B model requires a cluster of 20+ GPUs just to hold the training state in memory.

Mixed-precision training (using fp16 or bf16 for some operations) reduces these numbers, but the fundamental problem remains. Full fine-tuning of large models is expensive, slow, and inaccessible to most researchers and practitioners.

There had to be a better way.

The LoRA Insight

Craig Trim

The textbook frames fine-tuning as 'continuing the training process on domain-specific documents,' warning that it risks catastrophic forgetting where new training overwrites earlier knowledge. LoRA sidesteps this by freezing the original weights entirely. See GH #3, Ch. 5.

In June 2021, Edward Hu and colleagues at Microsoft Research published "LoRA: Low-Rank Adaptation of Large Language Models." The core insight was elegant: the weight updates that occur during fine-tuning have low intrinsic rank.

What does that mean?

Consider a weight matrix W in a neural network with dimensions d x d. During fine-tuning, this matrix changes by some amount ΔW. Full fine-tuning computes and stores this entire d x d update. But Hu et al. hypothesized (and demonstrated) that ΔW can be well-approximated by a low-rank decomposition.

Instead of computing a full d x d update matrix, decompose it as:

ΔW = B × A

where:
  B is d × r
  A is r × d
  r << d

If d = 4096 (a typical hidden dimension for a 7B model) and r = 8, the full update matrix has 4096 x 4096 = 16.7 million parameters. The low-rank decomposition has (4096 x 8) + (8 x 4096) = 65,536 parameters. That's a 256x reduction.

The original weight matrix W stays frozen. During the forward pass, the model computes:

output = Wx + BAx

The first term is the frozen pretrained computation. The second term is the learned adaptation. Only B and A receive gradients. Only B and A get optimizer states.

This changes the memory equation dramatically.

Why Low Rank?

The low-rank hypothesis isn't arbitrary. There's both theoretical and empirical support for it.

Aghajanyan et al. (2020) showed that pretrained language models have a low "intrinsic dimensionality," meaning that the effective parameter space for fine-tuning is far smaller than the total parameter count. They demonstrated that randomly projecting gradient updates into subspaces of dimension 200-800 (out of millions of total parameters) still produced competitive task performance.

Craig Trim

Widdows and Cohen draw a direct lineage from LoRA back to Latent Semantic Analysis (LSA) in Section 2.4 and Section 5.3.4. They note that LSA used principal component analysis on term-document matrices to find the most significant axes for word embeddings, and that LoRA uses a low-rank approximation to find the most significant axes for adjusting the layers of a neural network in exactly the same spirit. They also observe that random projections in high dimensions achieved much of the performance of full PCA decompositions, a mathematical insight from the 1980s that foreshadows the random initialization used in LoRA. Widdows & Cohen, Issue #45

Intuitively, this makes sense. A pretrained model has already learned rich representations of language. Fine-tuning for a specific task (say, converting freeform text into structured JSON) doesn't require rewriting the model's understanding of grammar, semantics, or world knowledge. It requires a small adjustment, a nudge in the right direction.

Craig Trim

SLP3 §6.3 provides the foundation for understanding why this works. Feedforward networks in Transformers project inputs through hidden layers that are typically wider than necessary. The hidden dimension (e.g., 4096) creates a representation space with far more capacity than any single task uses. Jurafsky and Martin explain that these layers learn distributed representations where multiple features share the same neurons. If the useful information for a downstream task occupies only a small subspace of the full hidden dimension, then the weight updates needed to adapt the model also live in that small subspace. That is precisely the low-rank structure LoRA exploits.

Low-rank matrices are exactly the mathematical tool for expressing such nudges.

LoRA Mechanics in Detail

Which Layers Get Adapters

A Transformer model contains many types of weight matrices: attention projections (Q, K, V, and output), feed-forward layers (up-projection and down-projection), embedding layers, and layer normalization parameters.

In the original LoRA paper, the authors experimented with applying adapters to different subsets of these matrices. They found that adapting the attention projection matrices (particularly Q and V) produced the best results for a given parameter budget. This is now the default configuration in most LoRA implementations.

Craig Trim

Widdows and Cohen explain in Section 4.3 that Q, K, and V are projection matrices whose learned weights map embedding vectors into lower-dimensional subspaces (e.g., from 512 dimensions to 64 per attention head). This clarifies why LoRA targets these matrices specifically: they are the core trainable projections that determine how the model allocates attention, and adjusting them with low-rank updates is a natural fit for the projection-based mathematics they already embody. Widdows & Cohen, Issue #45

However, subsequent research has shown that including the K and output projections, and even the feed-forward layers, can improve performance, especially for more demanding tasks. The Hugging Face PEFT library allows you to specify exactly which modules receive adapters:

↗ docs# Common configurations for target modules:

# Minimal (Q and V only, original paper default)
"target_modules": ["q_proj", "v_proj"]

# Attention-focused (all attention projections)
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]

# Comprehensive (attention + feed-forward)
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"]

The module names vary by architecture. LLaMA uses q_proj, k_proj, etc. GPT-style models may use c_attn, c_proj. Always check the model's named modules before configuring LoRA.

Rank Selection

The rank r is the single most important hyperparameter in LoRA. It controls the capacity of the adaptation: how much new information the adapter can encode.

Rank (r)	Trainable Params (7B model, QV only)	% of Base Model	Typical Use Case
4	~4.2M	0.06%	Simple style transfer, formatting
8	~8.4M	0.12%	Domain adaptation, moderate tasks
16	~16.8M	0.24%	Instruction tuning, complex formatting
64	~67.1M	0.96%	Knowledge-intensive tasks, multilingual
256	~268M	3.83%	Approaching full fine-tuning territory

Trainable parameter counts for different LoRA ranks applied to
Q and V projections on a 7B model with hidden dimension 4096.

In practice, r = 8 or r = 16 covers the vast majority of use cases. The original paper demonstrated that even r = 4 performed surprisingly well on many benchmarks. Going above r = 64 shows rapidly diminishing returns for most tasks.

Craig Trim

Raschka implements LoRA from scratch in PyTorch, showing that matrix B is initialized to zeros so LoRA doesn't alter original weights initially. With rank=16 on GPT-2 124M, trainable parameters drop from 124M to 2.7M (a ~50x reduction) while achieving 98% accuracy on classification, matching full fine-tuning. See GH #4, App. E.

The Alpha Scaling Parameter

LoRA introduces a scaling factor α (alpha) that controls the magnitude of the adapter's contribution. The actual update applied is:

ΔW = (α / r) × BA

The ratio α/r acts as a learning rate multiplier for the adapter. A common convention is to set α = 2r (so α = 16 when r = 8), which gives a scaling factor of 2. Some practitioners set α = r for a factor of 1, or use larger values like α = 32 with r = 8 for a factor of 4.

The initialization matters too. Matrix B is initialized to zeros and matrix A is initialized with random Gaussian values. This means the adapter output starts at exactly zero, so the model begins training from the pretrained weights with no perturbation. The adaptation grows from nothing as training proceeds.

Merging: Zero-Cost Inference

One of LoRA's most practical advantages is that adapters can be merged back into the base weights after training:

W' = W + (α / r) × BA

After this merge, you have a single weight matrix W' with exactly the same dimensions as the original. There is no additional latency at inference time. No extra memory. No architectural change. The model looks identical to a fully fine-tuned model.

This merge is a simple matrix addition, computable in seconds. And it's reversible: subtract BA from W' to recover the original pretrained weights.

Craig Trim

Widdows and Cohen emphasize this same property in Section 5.3.4: thanks to the linear nature of the mathematics involved, the updates can just be added to the original model at runtime, resulting in no extra burden on inference. They frame it as a key advantage over other PEFT methods, and cite up to a 10,000-fold reduction in trainable parameters. Widdows & Cohen, Issue #45

Multiple Adapters, One Base Model

Because adapters are small (typically 10-100 MB) and the base model is large (typically 10-30 GB), you can maintain one copy of the base model and swap adapters in and out for different tasks.

A single LLaMA-2 7B base model might serve as the foundation for a medical question-answering adapter, a code generation adapter, a legal document summarization adapter, and a customer support chatbot adapter. Each adapter is a pair of small matrices. Loading a different adapter takes milliseconds.

Craig Trim

Widdows and Cohen demonstrate exactly this pattern in Section 5.2.3. They show a LLaMA-65B model fine-tuned with low-rank adaptation on just 52,000 instruction-response pairs (about 40 MB of data, 100,000 times less than pretraining). The result transformed a next-token predictor into an instruction follower. They call the gap between sentence completion and instruction following smaller than it may seem, which supports the article's claim that LoRA adapters encode small behavioral nudges rather than wholesale rewrites. Widdows & Cohen, Issue #45

This architecture is especially powerful in serving environments. A single GPU can hold the base model in memory and serve dozens of different fine-tuned behaviors by swapping LoRA adapters per request. Services like Predibase and Together AI have built entire platforms around this pattern.

QLoRA: Pushing Efficiency Further

LoRA reduced the trainable parameters by orders of magnitude, but the base model still needed to fit in GPU memory. For a 7B model in fp16, that's 14 GB, manageable on a high-end consumer GPU. For a 65B model, it's 130 GB, requiring multiple GPUs just to hold the frozen weights.

In May 2023, Tim Dettmers, Artidoro Pagnoni, Ashish Sharma, and Luke Zettlemoyer at the University of Washington published "QLoRA: Efficient Finetuning of Quantized Language Models." Their contribution was combining aggressive quantization of the base model with LoRA training, achieving near-parity with full 16-bit fine-tuning.

The Three Innovations

1. NF4: 4-bit NormalFloat Quantization

Standard 4-bit quantization maps continuous weight values to 16 discrete levels (2^4 = 16). The mapping is usually uniform: equally spaced quantization bins across the weight range.

But neural network weights aren't uniformly distributed. They follow an approximately normal (Gaussian) distribution. Dettmers et al. introduced NormalFloat 4-bit (NF4), a data type where the 16 quantization levels are optimally spaced for normally distributed data. Each bin captures an equal probability mass under the normal curve, rather than an equal range of values.

The result: NF4 produces lower quantization error than standard int4 or fp4 for neural network weights, because the quantization bins match the actual distribution of values.

Craig Trim

Widdows and Cohen ground this insight in Section 5.3.5: most of the weights are often small and clustered around 0, with a few significant outliers that we really want to notice. They show how an E5M2 8-bit format allocates quantization levels to match this distribution, and note that reducing precision works better for inference than training because individual gradients need to be tracked closely. This is precisely the asymmetry QLoRA exploits: quantize the frozen base (inference-like) but train LoRA adapters in higher precision. Widdows & Cohen, Issue #45

2. Double Quantization

Quantization requires storing scaling factors that map between the quantized values and the original range. For blockwise quantization (where weights are divided into blocks of 64 or 128), these scaling factors add up. A 7B model with block size 64 needs about 110M scaling constants, each stored in fp32, consuming 440 MB.

Double quantization quantizes the quantization constants themselves. The fp32 scaling factors are quantized to 8-bit, reducing overhead from 0.5 bits per parameter to approximately 0.127 bits per parameter. The memory savings are modest in isolation but significant at scale.

3. Paged Optimizers

Even with a quantized base model and LoRA adapters, optimizer states can cause out-of-memory errors during training, particularly when processing long sequences. Paged optimizers use NVIDIA's unified memory feature to automatically page optimizer states between GPU and CPU memory. When the GPU runs low on memory, infrequently accessed optimizer pages are offloaded to CPU RAM and brought back on demand.

This prevents the dreaded OOM crash without requiring the user to manually manage memory allocation.

Craig Trim

A useful tangent: Widdows and Cohen discuss PagedAttention (Kwon et al.) in Section 5.3.5 as a related memory optimization for inference, where the KV cache does not need to be contiguous in memory, enabling multiple text streams to be generated simultaneously. The paged optimizers used in QLoRA apply a similar principle to training, paging optimizer states rather than KV caches. Both exploit NVIDIA unified memory to avoid OOM errors. Widdows & Cohen, Issue #45

The Combined Effect

Method	Base Model Precision	Trainable Params	Memory (7B model)
Full fine-tuning (fp32)	32-bit	7B (100%)	~160+ GB
Full fine-tuning (bf16)	16-bit	7B (100%)	~90+ GB
LoRA (bf16 base)	16-bit	~8-17M (0.1-0.2%)	~16-18 GB
QLoRA (NF4 base)	4-bit	~8-17M (0.1-0.2%)	~6-8 GB

Memory requirements comparison for fine-tuning a 7B parameter model.
QLoRA brings a 7B model within reach of a single consumer GPU with 8 GB VRAM.

QLoRA made it possible to fine-tune a 65B parameter model on a single 48 GB A6000 GPU. The team used this capability to train Guanaco, a chatbot based on LLaMA-65B, which achieved 99.3% of ChatGPT's performance on the Vicuna benchmark while training in under 24 hours on a single GPU.

Craig Trim

Alammar & Grootendorst explain that LoRA decomposes large weight matrices into smaller rank-decomposed matrices (typically rank 8–64), fine-tuning only about 3.6% of parameters. QLoRA then combines 4-bit quantization with LoRA, reducing memory from roughly 4 GB to about 1 GB. Their rule of thumb: always use at least a 4-bit quantized model. See GH #5, Ch. 12 and Ch. 7.

The democratization implications were immediate. Researchers without institutional compute budgets could now fine-tune frontier-class models. A graduate student with a single RTX 3090 could do work that previously required a cluster.

Adapter Methods Beyond LoRA

LoRA is the most popular parameter-efficient fine-tuning (PEFT) method, but it's not the only one. Several other approaches preceded it or offer complementary tradeoffs.

Adapter Layers (Houlsby et al., 2019)

The original "adapter" concept for Transformers came from Neil Houlsby and colleagues at Google Research in 2019. Their approach inserted small bottleneck layers (feedforward down-projection, nonlinearity, feedforward up-projection) between the existing Transformer layers.

During fine-tuning, only these inserted layers are trained. The original model parameters remain frozen.

flowchart TD
    A["Hidden state from
Transformer layer
dimension d (e.g. 4096)"] --> B["Compress: Linear(d → m)
4096 → 64 parameters"]
    B --> C["ReLU nonlinearity"]
    C --> D["Expand: Linear(m → d)
64 → 4096 parameters"]
    D --> E["Add original input back"]
    A -.->|"unchanged copy"| E
    E --> F["Modified hidden state
passed to next layer"]

    style A fill:#fff,stroke:#999,color:#292929
    style B fill:#f6f8fa,stroke:#6f42c1,stroke-width:2px,color:#292929
    style C fill:#f6f8fa,stroke:#6f42c1,stroke-width:2px,color:#292929
    style D fill:#f6f8fa,stroke:#6f42c1,stroke-width:2px,color:#292929
    style E fill:#fff3cd,stroke:#ffc107,color:#856404
    style F fill:#fff,stroke:#999,color:#292929

Adapter layer architecture (Houlsby et al., 2019). The bottleneck layers (purple) are the only trainable parameters, inserted between existing frozen Transformer layers. The skip connection (amber) ensures the adapter defaults to a no-op if the learned adjustment is small.

The key difference from LoRA: adapter layers add new parameters to the model architecture. They introduce additional computation at inference time. LoRA modifies existing weight matrices and can be merged away. This architectural distinction is why LoRA has largely superseded Houlsby-style adapters for production deployments where inference latency matters.

Craig Trim

Widdows and Cohen discuss distillation and pruning as complementary efficiency methods in Section 5.3.2. They note that DistilBERT reduced BERT's parameters by 40% while retaining 97% accuracy, and that early pruning work (Optimal Brain Damage) showed many neural network parameters can be discarded with minimal accuracy loss. These methods reduce the base model size, whereas adapter methods like LoRA reduce the fine-tuning cost, and can be combined. Widdows & Cohen, Issue #45

Prefix Tuning (Li and Liang, 2021)

Prefix tuning prepends a sequence of learnable "virtual tokens" to the input at every Transformer layer. These virtual tokens don't correspond to any real words. They are continuous vectors that the model learns to condition on.

For a prefix length of 20 and a model with 32 layers and hidden dimension 4096, the total trainable parameters are: 20 x 32 x 2 x 4096 = 5.2M (the factor of 2 accounts for key and value prefixes). The original model is untouched.

Prefix tuning works well for generation tasks but can underperform LoRA on discriminative tasks. It also slightly increases the effective sequence length (the prefix tokens consume part of the context window), which can be a practical limitation.

Craig Trim

Widdows and Cohen describe prefix tuning in Section 5.3.4, noting it was inspired by analogy with natural language prompt-prefixes such as "Please summarize:" or "Translate to Spanish:". Instead of retuning the whole LLM, a smaller neural network is trained to prepare the prefix vectors at each layer, adapting the combined network to a downstream task. This framing highlights that prefix tuning is conceptually closer to prompt engineering than to weight modification, which helps explain why it underperforms LoRA on tasks requiring deeper behavioral changes. Widdows & Cohen, Issue #45

Prompt Tuning (Lester et al., 2021)

Prompt tuning is a simplified version of prefix tuning. Instead of prepending virtual tokens at every layer, it only prepends them at the input embedding layer. The trainable parameters are just the embeddings of the virtual prompt tokens.

For a prompt length of 100 and embedding dimension 4096, the trainable parameters are just 100 x 4096 = 409,600. Less than half a million parameters, regardless of model size.

Lester et al. showed a remarkable scaling result: prompt tuning becomes more competitive with full fine-tuning as the model gets larger. At 10B+ parameters, prompt tuning nearly matched full fine-tuning performance on SuperGLUE, despite training fewer than 0.001% of the model's parameters.

Comparison at a Glance

Method	Where It Acts	Inference Overhead	Typical Params
Full fine-tuning	All weights	None	100%
Adapter layers	Inserted bottleneck layers	Yes (extra layers)	0.5-8%
Prefix tuning	Virtual tokens, all layers	Yes (longer sequence)	0.1-1%
Prompt tuning	Virtual tokens, input only	Yes (longer sequence)	<0.01%
LoRA	Low-rank updates to existing weights	None (after merge)	0.05-1%
QLoRA	LoRA + quantized base model	None (after merge)	0.05-1%

Parameter-efficient fine-tuning methods compared. LoRA's ability to merge into base weights,
eliminating inference overhead, is a significant practical advantage.

Practical Walkthrough: Fine-Tuning with PEFT and LoRA

Theory is useful. Running code is better. Here's a complete example of fine-tuning a LLaMA model with LoRA using Hugging Face's PEFT library, targeting the concrete task of teaching the model to produce structured JSON output from natural language input.

Setup and Configuration

↗ docsfrom transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    prepare_model_for_kbit_training,
)
from datasets import load_dataset
import torch

# Model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load model in 4-bit for QLoRA
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare model for k-bit training (freezes base, enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

LoRA Configuration

↗ docs# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                         # Rank: 16 is a solid default
    lora_alpha=32,                  # Alpha: 2x rank is common
    lora_dropout=0.05,              # Small dropout for regularization
    target_modules=[
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj",
        "down_proj",
    ],
    bias="none",                    # Don't train bias terms
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 6,779,371,520
# || trainable%: 0.6187

Data Preparation

↗ docs# Example: teaching the model structured JSON output
# Each training example pairs a natural language query
# with a JSON response format

def format_example(example):
    """Format a single training example as an instruction-response pair."""
    prompt = f"""### Instruction:
Extract structured information from the following text and
return it as JSON.

### Input:
{example["text"]}

### Response:
{example["json_output"]}"""
    return tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )

# Load and format dataset
dataset = load_dataset("json", data_files="train_data.jsonl")
tokenized = dataset["train"].map(format_example, remove_columns=dataset["train"].column_names)

Training

↗ docs# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-json-adapter",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,       # Effective batch size: 16
    learning_rate=2e-4,                   # Higher LR than full fine-tuning
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,                            # Use bfloat16 for training
    gradient_checkpointing=True,          # Trade compute for memory
    optim="paged_adamw_8bit",             # Paged optimizer for QLoRA
    report_to="none",
)

# Data collator for causal language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=data_collator,
)

trainer.train()

Saving and Merging

↗ docs# Save the LoRA adapter (small, ~100MB)
model.save_pretrained("./lora-json-adapter")
tokenizer.save_pretrained("./lora-json-adapter")

# Later: merge adapter into base model for deployment
from peft import AutoPeftModelForCausalLM

merged_model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-json-adapter",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
merged_model = merged_model.merge_and_unload()

# Save the merged model (same size as original, no adapter overhead)
merged_model.save_pretrained("./llama-7b-json-merged")

The entire training loop, from loading the quantized model to saving the merged result, runs on a single GPU with 8 GB of VRAM. Training time for a few thousand examples is typically 30 minutes to a few hours, depending on sequence length and dataset size.

Craig Trim

SLP3 §6.6.2 details the Adam optimizer used in the training code above. Adam maintains per-parameter running estimates of the first moment (mean) and second moment (uncentered variance) of the gradients, then uses these to adaptively scale learning rates. The paged_adamw_8bit used in QLoRA combines three optimizations on top of standard Adam: weight decay (the "W"), 8-bit quantization of optimizer states (reducing Adam's memory footprint by ~4x), and paging to CPU when GPU memory is tight. Each of these is a response to the memory arithmetic from the article's opening section.

Rank Selection: A Practical Guide

Choosing the right rank is more art than science, but empirical patterns have emerged across hundreds of published LoRA experiments.

The Diminishing Returns Curve

Multiple studies have shown that task performance improves steeply from r = 1 to r = 8, more gradually from r = 8 to r = 32, and plateaus or shows negligible improvement beyond r = 64. The exact inflection points depend on the task, the base model, and the dataset size, but the shape of the curve is remarkably consistent.

Hu et al. reported in the original paper that on the GLUE benchmark, r = 4 captured most of the performance of r = 64 for the GPT-3 175B model. Smaller models tend to benefit more from higher ranks, presumably because they have less capacity in their pretrained weights and need more room for adaptation.

Rules of Thumb

r = 4-8 for tasks that primarily adjust output format or style (structured output, tone, verbosity control). The base model already "knows" the content; you're teaching it a new way to express it.
r = 16-32 for moderate domain adaptation (medical, legal, financial language) where the model needs to learn some new vocabulary and concepts but can build on existing knowledge.
r = 64+ for knowledge-intensive tasks where the model must internalize genuinely new information, or for multilingual adaptation where the model is learning new scripts or grammatical structures.

When in doubt, start with r = 16. It's the default in most PEFT configurations for good reason: it provides enough capacity for the majority of practical tasks without excessive parameter overhead.

The Relationship Between Rank and Data

A subtlety often missed: higher ranks require more training data to avoid overfitting. A rank-64 adapter on 500 training examples will almost certainly overfit, memorizing the training data instead of learning generalizable patterns. A rank-4 adapter has fewer parameters and generalizes better from limited data.

As a rough guideline, ensure you have at least 100-200 training examples per million trainable parameters. For r = 16 with ~17M trainable parameters, that's approximately 1,700-3,400 examples. For r = 64 with ~67M parameters, aim for 6,700-13,400 examples.

Craig Trim

SLP3 §6.6.4 explains the regularization mechanisms that become critical at low data-to-parameter ratios. Dropout randomly zeroes neurons during training, forcing the network to distribute knowledge across multiple pathways. For LoRA, the lora_dropout=0.05 parameter in the configuration above applies dropout specifically to the adapter's low-rank matrices. With a rank-64 adapter trained on only 500 examples, increasing this dropout rate (or adding weight decay) can be the difference between a model that generalizes and one that memorizes.

When LoRA Isn't Enough

LoRA is powerful, but it has real limitations. Understanding when to reach for full fine-tuning, or a different approach entirely, is as important as knowing how to use LoRA.

Tasks That Need Full Fine-Tuning

Large-scale pretraining continuation. If you're extending a model's capabilities in a fundamental way, such as training it on an entirely new language or a massive domain-specific corpus, the low-rank constraint becomes a genuine bottleneck. The model needs to modify its representations broadly, not just nudge them.

Alignment and safety training. RLHF (Reinforcement Learning from Human Feedback) and related alignment techniques often require modifying the model's behavior in subtle, pervasive ways. Some alignment labs have found that LoRA-based RLHF produces weaker safety properties than full fine-tuning, though this is an area of active research.

Craig Trim

SLP3 §10.2 helps explain why alignment may resist low-rank adaptation. BERT's pretraining uses two objectives: masked language modeling (predicting masked tokens) and next sentence prediction (judging sentence coherence). These objectives shape representations across every layer of the network, not just the final output layers. Alignment training similarly needs to modify the model's behavior at every level, from how it represents concepts internally to how it generates outputs. A low-rank update to a few projection matrices may not reach deeply enough into these distributed representations to reliably change safety-relevant behaviors.

Multi-task generalization. If you need a single model to perform well across many diverse tasks simultaneously, the low-rank constraint can limit the model's ability to represent all the necessary task-specific information. Full fine-tuning, or very high-rank LoRA (which approaches full fine-tuning in cost), may be required.

Catastrophic Forgetting

LoRA partially mitigates catastrophic forgetting (the tendency of a fine-tuned model to lose its pretrained capabilities) because the base weights remain frozen. The adapter can only add to the model's behavior, not erase what's already there.

But "partially" is the operative word. While the frozen weights preserve the base model's knowledge, the adapter can still interfere with how that knowledge is accessed. A LoRA adapter trained heavily on medical text might degrade the model's general conversational abilities, not because the general knowledge is gone, but because the adapter's modifications route activations in ways that bypass it.

The mitigation is straightforward: evaluate on held-out general benchmarks during training, not just on your target task. If general performance drops below an acceptable threshold, reduce the rank, reduce the learning rate, or reduce the number of training steps.

Craig Trim

Widdows and Cohen provide a striking example of LoRA's practical value in Section 5.2.4. They show that parameter-efficient fine-tuning of a 405B LLaMA-3 model on just 1,000 reasoning examples from DeepSeek-R1 substantially improved the model's problem-solving ability (correctly counting palindromic primes below 1000). The general capabilities were not degraded. This suggests that for reasoning tasks at least, LoRA-scale fine-tuning can add new capabilities without triggering the catastrophic forgetting described here. Widdows & Cohen, Issue #45

The Efficiency-Performance Tradeoff

Published benchmarks consistently show that LoRA achieves 90-99% of full fine-tuning performance across a wide range of tasks. That last 1-10% matters in some contexts and doesn't in others.

For a production chatbot, 95% of full fine-tuning quality at 1% of the cost is an easy decision. For a safety-critical medical diagnostic system, the 5% gap might be unacceptable. For a research experiment exploring the limits of model capabilities, full fine-tuning provides the cleanest signal.

LoRA's contribution isn't making full fine-tuning obsolete. It's making fine-tuning accessible. The 99% of practitioners who don't have access to GPU clusters can now participate in model adaptation. That expansion of access has produced more innovation than any improvement in fine-tuning quality could have.

Craig Trim

Widdows and Cohen reinforce this accessibility theme in Section 5.1. They note that DeepSeek achieved dramatic cost reductions with a GPU training cost estimate under $6M, compared to over $100M for comparable models from Google and OpenAI, using techniques including mixture-of-experts and FlashAttention. They observe that some of the most famous models have been impactful because they found ways to do more with less. LoRA belongs squarely in this tradition of efficiency innovations that expand who can participate. Widdows & Cohen, Issue #45

Looking Forward

The LoRA paper has accumulated over 6,000 citations since its 2021 publication. An ecosystem of extensions and variants has emerged: DoRA (Weight-Decomposed Low-Rank Adaptation), AdaLoRA (Adaptive Budget Allocation for LoRA), LoRA+ (different learning rates for A and B matrices), and rsLoRA (rank-stabilized scaling).

The core idea, that meaningful adaptation lives in a low-rank subspace, has proven robust across model scales, architectures, and application domains. As models grow to hundreds of billions of parameters, the economic argument for parameter-efficient fine-tuning only strengthens. Full fine-tuning of a 400B model is a six-figure compute commitment. LoRA makes the same model adaptable for hundreds of dollars.

The practical implication for anyone building on large language models is clear: LoRA should be your default starting point for fine-tuning. Try it first. Measure the gap to full fine-tuning on your specific task. In most cases, there won't be one worth paying for.

Craig Trim

SLP3 §10.3 on contextual embeddings provides an interesting lens for understanding LoRA's future. Jurafsky and Martin show that BERT produces context-dependent word representations: the same word gets different vectors depending on its context. They also note that these representations exhibit anisotropy, clustering in a narrow cone of the vector space. Future LoRA variants may need to address this geometric constraint directly, adapting not just the weight matrices but the shape of the representation space itself.

. . .

References

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." arXiv:2305.14314.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML 2019.
Li, X. L. & Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL 2021.
Lester, B., Al-Rfou, R., & Constant, N. (2021). "The Power of Scale for Parameter-Efficient Prompt Tuning." EMNLP 2021.
Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2020). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." arXiv:2012.13255.
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). "GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022.
Liu, S., Wang, C., Yin, H., Molchanov, P., Wang, Y., Cheng, K., & Chen, M. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." arXiv:2402.09353.
Zhang, Q., Chen, M., Bukharin, A., He, N., Karampatziakis, N., & Chen, W. (2023). "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning." ICLR 2023.
Hugging Face. (2024). "PEFT: Parameter-Efficient Fine-Tuning." GitHub.