PEFT: Principle
- do not finetune all of the LLM parameters
- finetune/train a small number of (additional) parameters
We’ll focus on a few
- Additive finetuning: add new parameters
- Adapter-based: sequential adapter
- soft-prompt: prefix tuning
- others: ladder-side-networks
- Partial finetuning: modify existing parameters
- Lottery-ticket sparse finetuning
- Reparameterization finetuning: “reparameterize” weight matrices
- Hybrid finetuning: combine multiple PEFT
- manually: MAM, compacter, UniPELT
- auto: AutoPEFT, S3Delta-M
- Unified finetuning: unified framework
- AdaMix: MoE of LoRA or adapters
- SparseAdapter: prune adapters
- ProPETL: share masked sub-nets
Sequential adapters
\[X=(RELU(X\cdot W_{down})) \cdot W_{up} +
X\]
with
\[W_{down} \in R^{d\times k}~~~~W_{up} \in
R^{k\times d}\]
- Interesting extensions
- Parallel Adapter (parallel peft > sequential peft)
- CoDA: skip tokens in
the main branch, not in the parallel adapter
- Tiny-Attention
adapter: uses small attn as adapter
- Adapter
Fusion: (see next slide)
- Train multiple adapters, then train fusion
Prefix tuning
- Concat \(P_k,P_v \in R^{l\times
d}\) before \(K,V\) \[head_i = Attn(xW_q^{(i)},
concat(P_k^{(i)},CW_k^{(i)}), concat(P_v^{(i)},CW_v^{(i)})\]
- with \(C=\)context, \(l=\)prefix length
- ICLR22 shows some
form of equivalence:
- Advantages:
- More expressive than adapters, as it modifies every attention
head
- One of the best PEFT
method at very small parameters budget
- Drawbacks:
- Does not benefit from increasing nb of parameters
- Limited to attention head, while adapters may adapt FFN…
- … and adapting FFN is always better
Performance comparison
qLoRA = LoRA + quantized LLM
- Advantages:
- de facto standard: supported in nearly all LLM frameworks
- Many extensions, heavily developped, so good performances
- can be easily merged back into the LLM
- Drawbacks:
Adapter lib v3
- AdapterHubv3
integrates several family of adapters:
- Bottleneck = sequential
- Compacter = adapter with Kronecker prod to get up/down matrices
- Parallel
- Prefix, Mix-and-Match = combination Parallel + Prefix
- Uniformisation of PEFT functions: add_adapter(),
train_adapter()
- heads after adapters: add_classification_head(),
add_multiple_choice_head()
- In
HF lib, you can pre-load multiple adapters and select one
active:
model.add_adapter(lora_config, adapter_name="adapter_1")
model.add_adapter(lora_config, adapter_name="adapter_2")
model.set_adapter("adapter_1")
Ladder-side-networks
- Advantages:
- Do not backprop in the main LLM!
- Only requires forward passes in the main LLM
- Drawbacks:
- LLM is just a “feature provider” to another model
- \(\simeq\) enhanced
“classification/generation head on top”
- Forward pass can be done “layer by layer” with “pipeline
parallelism”
- load 1 layer \(L_i\) in RAM
- pass the whole corpus \(y_i=L_i(x_i)\)
- free memory and iterate with \(L_{i+1}\)
- LST: done only once for the whole training session!
- This approach received an outstanding award at ACL’2024:
Partial finetuning
- Add a linear layer on top and train it
- You may further backprop gradients deeper in the top-N LLM layers
- … Or just FT the top-N layers without any additional parameters
- Simple, old-school, it usually works well
- Fill the continuum between full FT and classifier head FT:
- can FT top 10%, 50%, 80% params
- or FT bottom 10%, 50% params
- or FT intermediate layers / params
- or apply a sparse mask?
Lottery-ticket sparse
finetuning
- Lottery Ticket
Hypothesis:
- Each neural network contains a sub-network (winning ticket) that, if
trained again in isolation, matches the performance of the full
model.
- Advantages:
- Can remove 90% parameters nearly without loss in performances (on
image tasks)
- Drawbacks:
- Impossible to find the winning mask without training first the large
model
can be applied to sparse
FT
FT an LLM on specific task/lang
extract the mask = params that change most
rewind the LLM and re-FT with mask
sparse finetunes can be combined without overlapping!
Wrap-up
- Various PEFT methods:
- Reduce model storage? RAM requirements?
- Require backprop through the LLM?
- Additional inference cost?
Finetuning (PEFT or full):
advantages
- greatly improve performances on a target task, language, domain
- dig knowledge up to the surface, ready to use
- give the LLM desirable capacities: instruction-following, aligned
with human preferences…
Finetuning (PEFT or full):
drawbacks
Memorization, forgetting
Pretraining and FT use same basic algorithm (SGD): what is the
difference?
- Difference in scale:
- Pretraining ingests trillions of tokens
- Finetuning uses up to millions of tokens
- This leads to differences in regimes / behaviour:
- Pretraining learns new information
- Finetuning exhumes information it already knows
Why such a difference in regimes?
- Because of the way SGD works:
- When it sees one piece of information, it partially stores
it in a few parameters
- But not enough to retrieve it later!
- When it sees it again, it accumulates it in its weights
\(\rightarrow\)
Memorization
- If it never sees it again, it will be overriden \(\rightarrow\)
Forgetting
- How many times shall a piece of information be seen?
- Finetuning hardly learns new knowledge:
- small data \(\rightarrow\) not
enough exposure
- Why not repeat 1000x the finetuning dataset?
- Because previous knowledge will be forgotten!
Why doesn’t pretraining forget?
- It does!
- But by shuffling the dataset, each information is repeated all along
training
- So how to add new knowledge?
- continued pretraining: replay + new data
- RAG, external knowledge databases
- LLM + tools (e.g., web search)
- knowledge editing (see ROME, MEND…)
Take home message
- PEFT is used to adapt to a domain, not to add knowledge
- RAG/LLM agents are used to add knowledge (but not at scale)