LoRA

Step 1 of 7

The Fine-Tuning Bottleneck in 2021

A pretrained language model's weights encode general linguistic and world knowledge that survives largely unchanged when the model is adapted to a downstream task. Full fine-tuning nonetheless updates every one of those weights. For a GPT-3 175B checkpoint that is 175 billion trainable parameters per task, a 1.2 TB optimizer state under Adam, and a 350 GB snapshot on disk for every downstream customer. The cost scales with the base model, not with how much the model actually needs to change to handle a new task.

LoRA argues the right question is not "how do we make fine-tuning cheaper" but "what is the intrinsic rank of the weight update a task requires." The paper's answer — almost always very small, often one — motivates a reparametrization that trains a rank-\( r \) matrix in place of the full-rank update, cuts trainable parameters by up to four orders of magnitude, and incurs zero additional latency at inference.

What prior parameter-efficient methods already offered

Two families of prior work addressed the same cost problem and the paper positions LoRA against both. Adapter layers (Houlsby et al., 2019) insert small bottleneck feed-forward modules between Transformer sub-layers and train only those. Prefix and prompt tuning (Li and Liang, 2021; Lester et al., 2021) prepend trainable continuous vectors to the input sequence while keeping the backbone frozen.

Each has a structural cost. Adapters add depth to the computation graph: every forward pass has to run through the extra bottleneck sub-layer even at deployment, and Table 1 of the paper reports a 30.3% inference-latency increase on GPT-2 Medium at batch size 1, sequence length 128, compared to the un-adapted base model. Prefix tuning consumes positions in the context window that the model can no longer use for real input, and the paper documents that performance is non-monotone in prefix length — accuracy drops as more tokens are reserved. LoRA is designed to keep the parameter-efficiency benefit while paying neither cost.

LoRA is not an adapter in the Houlsby sense

Common misconception: because both methods inject small trainable sub-modules, LoRA is just another adapter variant. LoRA sits in parallel with a pretrained weight matrix and has the exact same input and output shape, so at deployment \( W_0 + BA \) is precomputed and stored as a single matrix — the forward pass is indistinguishable from the original model's. Adapters sit in series and cannot be fused, so the extra depth persists at inference.

The paper's headline claim

Applied to GPT-3 175B, LoRA reduces trainable parameters by 10,000× (from 175B to 4.7M in the paper's main configuration, Table 4), cuts training VRAM from 1.2 TB to 350 GB by avoiding the Adam optimizer state on frozen weights, and shrinks per-task checkpoints from 350 GB to 35 MB at rank \( r = 4 \). Training throughput rises 25% on GPT-3 175B (43.1 vs 32.5 tokens/s per V100). Downstream quality matches or exceeds full fine-tuning on GLUE (RoBERTa, DeBERTa), on the E2E NLG benchmark (GPT-2), and on WikiSQL / MNLI-m / SAMSum (GPT-3). The remainder of this study unpacks the reparametrization (§2), where it is plugged into the Transformer (§3), the empirical numbers that backed the claim (§4), and the ablations that showed rank 1–4 is already enough (§5).