Retrieval Augmented Generation (RAG)
What are the components of a RAG system?
- Embedding model
- encodes all document “paragraphs”
- Vector store
- stores all “paragraph” embeddings
- Retriever
- Finds 10 most relevant “paragraphs” to question Q
- LLM
- Generates answer given 10 retrieved paragraphs and Q
- Embedding model is from the SBERT family
- LLM is from the GPT family
- Both are transformers!
What is the difference between the embedding model and the LLM? Why
not a single LLM?
- Embedder:
- role: generate an embedding that represents the semantics of the
paragraph
- small: its task is relatively easy: finding the semantically closest
paragraphs to the question;
- fast: there may be many paragraphs to encode
- size < 1b parameters (usually)
- LLM:
- role: understand context and answer question
- large: its task requires lots of knowledge and reasoning
- size > 7b parameters (usually)
- Embedders and LLM ~= transformers
If both embedders and LLM are transformers, what is their difference?
(Think training)
- Goal of embedders = compute 1 semantic embedding vector
- Training: masked language modeling, contrastive
- Goal of LLM = generate answer
- Training: causal language modeling
- Masked Language Modeling (MLM) objective:
- mask a random word in the input sentence and ask the model to
predict it
- Causal Language Modeling (LM) objective:
- remove the end of the input sentence and ask the model to predict
the next word
- In both cases (MLM and LM):
- we get the embedding \(z \in R^d\)
at the output of the transformer
- we predict the target word \(\hat w \in
V\) from \(z\) through a linear
classifier:
- logits = \(y\in
R^{|V|}\) = scores for each possible word: \(y = Ez\) with \(E\in R^{|V|\times d}\) \[\hat w = \arg\max_{1\leq i\leq |V|}
y_i\]
- So why is \(z\) good for semantic
search with MLM but not with LM?
- LM: the final \(z\) only contains
information about the next word
- so no information about the input sentence itself!
- = GPT family
- MLM: we must be able to reconstruct any word from \(z\)
- so the input sentence must be fully contained within \(z\)
- = BERT family
TP: RAG
- Implement a RAG using only python and ollama
- Several libraries enable implement RAG: transformers,
sentence-transformers, llamaindex, langchain, haystack, DSPy…
- We’ll use ollama, which main advantages are:
- Designed to be ready-to-use & easy-to-learn
- Designed to run locally on laptops
- It is fast, uses 4-bits models by default, supports embeddings and
LLMs
- Data
- we’ll use headlines scrapped from France Info in 2024
- The goal is to be able to query the LLM about recent news in French
and with the French point of view
- Embedding model:
- Embedding model: must support French, be lightweight
- See the HF
MTEB leaderboard
- We’ll use the paraphrase-multilingual-minilm
- LLM model:
- Must be good in French, and lightweight
- We’ll use Qwen2.5-7b quantized in 4 bits (reqs: RAM > 8GB)
- Here’s a code you can copy/paste in your python environment
import ollama
from numpy.linalg import norm
import numpy as np
# first download data: wget https://olki.loria.fr/cerisara/lexres/frnews.txt
# embedding model:
em="nextfire/paraphrase-multilingual-minilm"
def find_most_similar(needle, haystack):
needle_norm = norm(needle)
similarity_scores = [
np.dot(needle, item) / (needle_norm * norm(item)) for item in haystack
]
print("debug",similarity_scores)
return sorted(zip(similarity_scores, range(len(haystack))), reverse=True)
SYSTEM_PROMPT = """You are a helpful reading assistant who answers questions
based on snippets of text provided in context. Answer only using the context provided,
being as concise as possible. If you're unsure, just say that you don't know.
Context:
"""
with open("frnews.txt","r") as f: lines = f.readlines()
bdd = []
for i,l in enumerate(lines):
if i>=50: break
# see https://sbert.net/examples/applications/computing-embeddings/README.html
embeddings = ollama.embeddings(model=em, prompt=l)["embedding"]
bdd.append(embeddings)
print("bdd built")
q="Dans quelle ville y a-t-il eu des canicules ?\n"
prompt_embedding = ollama.embeddings(model=em, prompt=q)["embedding"]
most_similar_chunks = find_most_similar(prompt_embedding, bdd)[:1]
print("retrieved:",most_similar_chunks,lines[most_similar_chunks[0][1]])
response = ollama.chat(
model="qwen2.5",
messages=[
{
"role": "system",
"content": SYSTEM_PROMPT
+ "\n".join([lines[x[1]] for x in most_similar_chunks]),
},
{"role": "user", "content": q},
],
)
print("\n\n")
print(response["message"]["content"])
# see https://decoder.sh/videos/rag-from-the-ground-up-with-python-and-ollama
- TODO:
- run the code and check that it’s working fine
- try to increase the size of the vector-DB (50 for now) and
optionally store the vectors on disk so that they don’t have to be
recomputed if it’s too slow
- invent 5 questions that have an answer in your database, and post
them here
- (opt) Find another domain than FR news with documents (often PDFs,
but they need to be converted into text) and adapt this code for this
other domain
- Notes:
- when the database becomes large, you must use specialized vector
databases and/or specialized search library like Meta FAISS.
- in practice, most common issues with RAG come from the retriever,
which does not get the “most relevant” documents;
- real RAG applications require adapting the retriever to business
concepts: e.g., with FR news, the “date” should be a primary key to
retrieve relevant context and should be handled separately.
- many enhancements of this basic RAG pipeline have
been proposed.
- You have seen MLM and LM training
- But RAG embedders \(\neq\) vanilla
BERT
- Embedders \(\in\)
SBERT (sentence-BERT) family
- continue training BERT contrastively
- Why vanilla BERT are not good enough embedders?
- Because vanilla BERT produces an embedding space where paraphrases
are not always close together
- Contrastive learning is a training objective to:
- Control the “shape” of the embedding space
- align multimodal embeddings
BERT is not good enough: similar words are not close enough in
its embedding space.
def: embedding space = vector (\(\in R^d\)) that BERT outputs for each word
in the sentence. If you pass into BERT every English sentences, you
obtain the full embedding space.
Let’s verify!
TP: visualizing embeddings
- Download distilBERT, compute the embeddings for a few sentences with
paraphrases or not
- Compute distances and project embeddings in 2D with t-SNE to plot
them
- Study the proximity (or not) of paraphrases
Hints below…
pip install transformers[torch]
from transformers import AutoTokenizer, AutoModel, pipeline
model = AutoModel.from_pretrained('distilbert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
nlp = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
s = 'Do you like cakes ?'
features = nlp(s)
print([features[0][i][:2] for i in range(len(features[0]))])
- Look at the input tokens:
inputs = tokenizer.encode_plus(s, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(text_tokens)
- Compute the distance between 2 embedding vectors:
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(v1,v2))
- project a distance matrix:
from sklearn import manifold
tsne = manifold.TSNE(n_components=2, metric="precomputed", perplexity=2)
res = tsne.fit(m)
import matplotlib.pyplot as plt
plt.scatter( coords[:, 0], coords[:, 1], marker = 'o')
plt.show()
Contrastive objective
- Compute emb for sent A and B; when sentences are paraphrase,
minimize \(|s_A-s_B|\); when they’re
different, maximize it.
- See also metric learning, siamese networks, ranking loss
- This enables to control / shape the embedding
space the way we want
- used for:
- Pretrained Dense Retrieval in RAG
- multimodal models (CLIP)
Contrastive losses
\[L=\biggl\{\begin{matrix}
d(s_A,s_B) & if~~Positive Pair\\
\max(0,m-d(s_A,s_B)) & if~~Negative Pair
\end{matrix}\]
- triplet loss: \(L=\max(d(s_A,s_P) -
d(s_A,s_N) + \epsilon, 0)\)
- gives better embedding space
- InfoNCE: \(N\) batches with \(M\) samples: 1 positive (0) and \(M-1\) negative (\(1\dots M-1\)):
\[L= - \frac 1 N \sum_{i=1}^N \log
\frac{e^{sim(s_{A_i},s_0)}}{\frac 1 M \sum_{j=0}^{M}
e^{sim(s_{A_i},s_j)}}\]
- Main challenge: how to sample negative examples?
- easy neg: too far from pos, nothing is learnt
- hard neg: too close to pos, instable learning
- semi-hard negatives!
Once an embedding space is trained, how can you use it to directly
perform instance-based classification?
- refs: Lilan
Weng blog
- You compare the unknown embedding with all known (training)
embeddings, and assign it the class of the closest known
- important to understand this method!
Ranking vs. Cross-ent loss
(see github
blog)
- Train ConvNet on MNIST with 10-class cross-entropy loss
- No need of tSNE when training: directly define a 2D-embedding space
by adding a final linear layer to the model
- Distance btw classes not good
- Distance btw classes are good
- Didn’t train for same-class embed. get closer, but trained for same
class embed. to be closer than inter-class embed.
Other advantages of ranking
loss
- Cross-Ent loss not robust to noisy labels
- Many classes are costly with softmax
- Meaningful dist btw embeddings is desirable (S-BERT)
- Goal: train an embedding model with triplet loss
- Until now, you’ve only used already trained embedding
models
- Synthetic data:
- scalar input, 2 classes \(c\in
\{0,1\}\)
- \(x|c \sim
N(\mu_c,\sigma_c=0.1)\)
- Embedding dim = 5
- lightning = pytorch library that
- automates cpu/gpu runtime
- simplifies training loop
- generates tensorboard logs
- pytorch lightning in practice:
- replace and extend nn.Module:
import pytorch_lightning as pl
class Mod(pl.LightningModule):
def __init__(self):
super().__init__()
self.W = torch.nn.Linear(1,5)
def configure_optimizers(self):
opt = torch.optim.AdamW(self.parameters(), lr = 1e-3)
return opt
def training_step(self, batch, batch_idx):
anc, pos, neg = batch
ea = self.W(anc)
ep = self.W(pos)
en = self.W(neg)
dp = torch.nn.functional.triplet_margin_loss(ea,ep,en)
self.log("train_loss", dp, on_step=False, on_epoch=True)
return dp
- you need a dataset that generates anchors/pos/neg:
class TripDS(torch.utils.data.Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 1000
def __getitem__(self,i):
if i%2==0:
# pair: on sample une ancre from class 1
z = random.randint(0,1)
if z==0: xa = torch.randn(1)/10.-0.5
else: xa = torch.randn(1)/10.+1.5
z = random.randint(0,1)
if z==0: xp = torch.randn(1)/10.-0.5
else: xp = torch.randn(1)/10.+1.5
xn = torch.randn(1)/10.+0.5
else:
# impair: on sample une ancre from class 2
xa = torch.randn(1)/10.+0.5
xp = torch.randn(1)/10.+0.5
z = random.randint(0,1)
if z==0: xn = torch.randn(1)/10.-0.5
else: xn = torch.randn(1)/10.+1.5
return xa,xp,xn
traindata = TripDS()
trainloader = torch.utils.data.DataLoader(traindata, batch_size=1, shuffle=False)
mod = Mod()
logger = pl.loggers.TensorBoardLogger(save_dir="logs/", flush_secs=1)
trainer = pl.Trainer(limit_train_batches=1.0, max_epochs=1000, log_every_n_steps=1,logger=logger)
trainer.fit(model=mod, train_dataloaders=trainloader)
- TODO:
- run this training and observe the logs with:
tensorboard --logdir=lightning_logs/
- does it converge?
- adap this code so that the model is a 2 layer MLP, the output
embedding space is 2D, and plot 100 points with matplotlib before and
after training
CLIP
- CLIP is a model that builds a joint text/image embedding space
- uses 2 transformers, resp. for image and text + cosine dist
- It is trained with a ranking loss: multi-class N-pair loss
- they show that ranking loss much faster to train
- Given an input image and several texts, it outputs similarity
scores
from transformers import CLIPProcessor, CLIPModel, CLIPTokenizer
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("difftrain.png")
inputs = processor(text=["a rabbit","a curve","a chair"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
Wrap-up
- You can define your own distance between embeddings with contrastive
learning
- For generation, use a large LLM (trained on next word
prediction)
- RAG is the first method to try when you have to deal with company’s
documents
- … but RAG is not magic, retrieval requires a lot of work specific to
each case
- RAG is good to quickly build something, but you’ll eventually find
it limited
- Often need to adapt more precisely the embedders and LLM to your
task/language/context/…
- Main options to adapt:
- Prompt engineering
- Finetuning
- Integrate AI within soft. system:
- AI as features computer
- LLM agents
- AI as features computer
- AI is used as a black box to represent input
sentences/speech/images/…
- Each input is passed to AI model that outputs an embedding
- This embedding is used as input to the rest of the software
system
- LLM agents
- LLM controls (part of) the data flow
- Tools/Function calling: LLMs call APIs
- Code generation: LLMs generate (and execute)
code
- Planning: LLMs plan actions and
orchestrate their execution
- AI as features computer can be viewed as special case of finetuning:
- X \(\rightarrow\) AI \(\rightarrow\) Embeddings \(\rightarrow\) Classifier
- The AI module may be kept frozen
- But the classifier must be trained on data:
- It’s often best to further train backward into the AI model, but
it’s much more costly
Finetuning
- Finetuning = continue training the AI model on domain-specific data
- The training objective may change (e.g., new image
classification)
- Or it may stay the same as pretraining (e.g., language
modeling)
- Pretraining \(\rightarrow\)
Foundation models
- Finetuning \(\rightarrow\)
Domain-specific models
- Why not just training a small model from scratch on the target
domain?
- Transfer learning: we expect to transfer
capabilities from the generic AI to get a better target model
- Small data: we often don’t have enough domain data
to train a small model from scratch, but specializing the generic AI
model usually requires few data
- Stochastic Gradient Descent (SGD) algorithm:
- You need a training corpus \(C =
\{x_i,y_i\}_{1\leq i\leq N}\)
- Initialize the model’s parameters randomly: \(\theta_i \sim
\mathcal{N}(0,\mu,\Sigma)\)
- Forward pass: sample one example \(x_i \sim \mathcal{U}(C)\) and predict its
output: \(\hat y=f_{\theta}(x_i)\)
- Compute the loss = error made by the model: \[l(\hat y, y_i) = ||\hat y - y_i||^2\]
- Backward pass: compute the gradient of the loss
with respect to each parameter: \[\nabla
l(\hat y, y_i) = \left[ \frac {\partial l(\hat y, y_i)}{\partial
\theta_k}\right]\]
- Update parameters: \(\theta_k \leftarrow
\theta_k - \epsilon \frac {\partial l(\hat y, y_i)}{\partial
\theta_k}\)
- Iterate from the forward pass
- Backpropagation algorithm (for the backward pass):
- Compute the derivative of the loss wrt the output: \(\frac {\partial l(\hat y, y_i)}{\partial
\theta_T}\)
- Use the chain rule to deduce the derivative of the loss after the op
just before: \[\frac {\partial l(\hat y,
y_i)}{\partial \theta_{T-1}} = \frac {\partial l(\hat y, y_i)}{\partial
\theta_T} \times
\frac {\partial \theta_T}{\partial \theta_{T-1}}\]
- Only requires to know the analytic derivative of each op
individually
- Iterate back to the input of the model
Motivation for PEFT
- PEFT = Parameter-Efficient Fine-Tuning
- It’s just finetuning, but cost-effective:
- only few parameters are finetuned
- cheaper to train
- cheaper to distribute
When do we need finetuning?
- Improve accuracy, adapt LLM behaviour
- Finetuning use cases:
- Follow instructions, chat…
- Align with user preferences
- Adapt to domain: healthcare, finance…
- Improve on a target task
- So finetuning is just training on more data?
- Yes:
- Same training algorithm (SGD)
- No:
- different hyperparms (larger learning rate…)
- different type of data
- higher quality, focused on task
- far less training data, so much cheaper
- not the same objective:
- adaptation to domain/style/task/language…
Pretrained LLM compromise
- Training an LLM is fundamentally a compromise:
- training data mix: % code/FR/EN…
- text styles: twitter/books/PhD…
- Pretraining data mix defines where the LLM excels
- Finetuning modifies this equilibrum to our need
- The art of pretraining:
- finding the balance that fits most target users’ expectation
- finding the balance that maximizes the LLM’s capacities +
adaptability
- e.g., pretraining only on medical data gives lower performance even
in healthcare, because of limited data size and lack of variety.
- But for many specialized tasks, pretrained LLM does not give the
best performance:
- Finetuning adapts this compromise
- So finetuning is required for many specialized domains:
- enterprise documentations
- medical, finance…
- But it is costly to do for large LLMs:
- collecting, curating, cleaning, formatting data
- tracking training, preventing overfitting, limiting forgetting
- large LLMs require costly hardware to train
- For instance, finetuning LLama3.1-70b requires GPUs with approx. 1TB
of VRAM
- Can’t we avoid finetuning at all, but still adapt the LLM to our
task?
If the LLM good enough, no need to finetune?
- Alternative: prompting
- “Be direct and answer with short responses”
- “Play like the World’s chess champion”
- Alternative: memory/long context/RAG
- “Adapt your answers to all my previous interactions with you”
- Alternative: function calling
- “Tell me about the events in July 2024”
Is it possible to get a good enough LLM?
- more data is always best (even for SmolLM!)
- So why not training the largest LLM ever on all data and use it
everywhere?
- Usage cost
- Obsolescence
- Data bottleneck
- So far, not good enough for most cases!
- Better approach (in 2024):
- For each task (domain, language):
- gather “few” data
- adapt an LLM to the task
- Because it is done multiple times, training costs become a
concern
- Parameter-efficient training (PEFT)
Which pretrained LLM to
finetune?
- Option 1: large LLM
- benefit from best capacities
- fine for not-so-much specialized tasks
- high cost
- Option 2: “small” LLM
- fine for very specialized task
- low cost
- hype: small agent LLMs, smolLM
- larger LLM \(\rightarrow\) less
forgetting
Challenges
- Choose pretrained LLM
- Depends on the task and expected performance, robustness…
- Collect quality data
- Finetuning data must be high quality!
- Format data
- Format similar to final task
- FT on raw text may impact instruction following
- Track & prevent overfitting, limit forgetting
- Cost of finetuning may be high
Cost
- Cost of inference << cost of finetuning
- quantization: we don’t know (yet) how to finetune well quantized
LLMs; so finetuning requires 16 or 32 bits
- inference: no need to store all activations: compute each layer
output from it’s input only
- inference: no need to store gradients, momentum
- Inference can be done with RAM = nb of parameters / 2
- Full finetuning requires RAM = \(11\times\) nb of parameters (according to
Eleuther-AI), \(12-20\times\) according
to UMass
- 1 parameter byte = +1B (gradient) + 2B (Adam optimizer state: 1st
and 2nd gradient moments) (see next slide)
- Can be reduced to \(\simeq
5\times\):
- gradient checkpointing
- special optimizers (1bitAdam, Birder…)
- offloading…
- Adam equations:
- \(m^{(t)} = \beta_1 m^{(t-1)} +
(1-\beta_1) \nabla L(\theta^{(t-1)})\)
- \(v^{(t)} = \beta_2 v^{(t-1)} +
(1-\beta_2) \left(\nabla L(\theta^{(t-1)})\right)^2\)
- Bias correction:
- \(\hat m^{(t)} = \frac
{m^{(t)}}{1-\beta_1}\)
- \(\hat v^{(t)} = \frac
{v^{(t)}}{1-\beta_2}\)
- \(\theta^{(t)} = \theta^{(t-1)} -
\lambda\frac{\hat m^{(t)}} {\sqrt{\hat v^{(t)}} +
\epsilon}\)
- PEFT greatly reduce RAM requirements:
- can keep LLM parameters frozen and quantized (qLoRA)
- store gradients + momentum only in 1% of parameters
- But:
- still need to backpropagate gradients through the whole LLM and save
all activations
- with large data, PEFT underperforms full finetuning
VRAM usage
Full |
32 |
120GB |
240GB |
600GB |
1200GB |
2000GB |
900GB |
2400GB |
Full |
16 |
60GB |
120GB |
300GB |
600GB |
900GB |
400GB |
1200GB |
LoRA/GaLore/BAdam |
16 |
16GB |
32GB |
64GB |
160GB |
240GB |
120GB |
320GB |
QLoRA |
8 |
10GB |
20GB |
40GB |
80GB |
140GB |
60GB |
160GB |
QLoRA |
4 |
6GB |
12GB |
24GB |
48GB |
72GB |
30GB |
96GB |
QLoRA |
2 |
4GB |
8GB |
16GB |
24GB |
48GB |
18GB |
48GB |
Training methods
Pretraining |
>10T |
Full training |
Cont. pretr. |
\(\simeq
100\)b |
update: PEFT? |
Finetuning |
1k … 1b |
Adapt to task: PEFT |
Few-Shot learning |
< 1k |
Guide, help the LLM |
Wrap-up
- With enough compute, prefer full-finetuning
- HF transformer, deepspeed, llama-factory, axolotl…
- With 1 “small” GPU, go for PEFT
- Without any GPU: look for alternatives
PEFT methods
- do not finetune all of the LLM parameters
- finetune/train a small number of (additional) parameters
We’ll focus on a few
- Additive finetuning: add new parameters
- Adapter-based: sequential adapter
- soft-prompt: prefix tuning
- others: ladder-side-networks
- Partial finetuning: modify existing parameters
- Lottery-ticket sparse finetuning
- Reparameterization finetuning: “reparameterize” weight matrices
- Hybrid finetuning: combine multiple PEFT
- manually: MAM, compacter, UniPELT
- auto: AutoPEFT, S3Delta-M
- Unified finetuning: unified framework
- AdaMix: MoE of LoRA or adapters
- SparseAdapter: prune adapters
- ProPETL: share masked sub-nets
Sequential adapters
\[X=(RELU(X\cdot W_{down})) \cdot W_{up} +
X\]
with
\[W_{down} \in R^{d\times k}~~~~W_{up} \in
R^{k\times d}\]
- Interesting extensions
- Parallel Adapter (parallel peft > sequential peft)
- CoDA: skip tokens in
the main branch, not in the parallel adapter
- Tiny-Attention
adapter: uses small attn as adapter
- Adapter
Fusion: (see next slide)
- Train multiple adapters, then train fusion
Prefix tuning
- Concat \(P_k,P_v \in R^{l\times
d}\) before \(K,V\) \[head_i = Attn(xW_q^{(i)},
concat(P_k^{(i)},CW_k^{(i)}), concat(P_v^{(i)},CW_v^{(i)})\]
- with \(C=\)context, \(l=\)prefix length
- ICLR22 shows some
form of equivalence:
- Advantages:
- More expressive than adapters, as it modifies every attention
head
- One of the best PEFT
method at very small parameters budget
- Drawbacks:
- Does not benefit from increasing nb of parameters
- Limited to attention head, while adapters may adapt FFN…
- … and adapting FFN is always better
Performance comparison
qLoRA = LoRA + quantized LLM
- Advantages:
- de facto standard: supported in nearly all LLM frameworks
- Many extensions, heavily developped, so good performances
- can be easily merged back into the LLM
- Drawbacks:
Adapter lib v3
- AdapterHubv3
integrates several family of adapters:
- Bottleneck = sequential
- Compacter = adapter with Kronecker prod to get up/down matrices
- Parallel
- Prefix, Mix-and-Match = combination Parallel + Prefix
- Uniformisation of PEFT functions: add_adapter(),
train_adapter()
- heads after adapters: add_classification_head(),
add_multiple_choice_head()
- In
HF lib, you can pre-load multiple adapters and select one
active:
model.add_adapter(lora_config, adapter_name="adapter_1")
model.add_adapter(lora_config, adapter_name="adapter_2")
model.set_adapter("adapter_1")
Ladder-side-networks
- Advantages:
- Do not backprop in the main LLM!
- Only requires forward passes in the main LLM
- Drawbacks:
- LLM is just a “feature provider” to another model
- \(\simeq\) enhanced
“classification/generation head on top”
- Forward pass can be done “layer by layer” with “pipeline
parallelism”
- load 1 layer \(L_i\) in RAM
- pass the whole corpus \(y_i=L_i(x_i)\)
- free memory and iterate with \(L_{i+1}\)
- LST: done only once for the whole training session!
- This approach received an outstanding award at ACL’2024:
Partial finetuning
- Add a linear layer on top and train it
- You may further backprop gradients deeper in the top-N LLM layers
- … Or just FT the top-N layers without any additional parameters
- Simple, old-school, it usually works well
- Fill the continuum between full FT and classifier head FT:
- can FT top 10%, 50%, 80% params
- or FT bottom 10%, 50% params
- or FT intermediate layers / params
- or apply a sparse mask?
Lottery-ticket sparse
finetuning
- Lottery Ticket
Hypothesis:
- Each neural network contains a sub-network (winning ticket) that, if
trained again in isolation, matches the performance of the full
model.
- Advantages:
- Can remove 90% parameters nearly without loss in performances (on
image tasks)
- Drawbacks:
- Impossible to find the winning mask without training first the large
model
can be applied to sparse
FT
FT an LLM on specific task/lang
extract the mask = params that change most
rewind the LLM and re-FT with mask
sparse finetunes can be combined without overlapping!
Wrap-up
- Various PEFT methods:
- Reduce model storage? RAM requirements?
- Require backprop through the LLM?
- Additional inference cost?
Finetuning (PEFT or full):
advantages
- greatly improve performances on a target task, language, domain
- dig knowledge up to the surface, ready to use
- give the LLM desirable capacities: instruction-following, aligned
with human preferences…
Finetuning (PEFT or full):
drawbacks
Memorization, forgetting
Pretraining and FT use same basic algorithm (SGD), but the
differences in data size lead to differences in training regimes.
- Difference in scale:
- Pretraining ingests trillions of tokens
- Finetuning uses up to millions of tokens
- This leads to differences in regimes / behaviour:
- Pretraining learns new information
- Finetuning exhumes information it already knows
Why such a difference in regimes?
- Because of the way SGD works:
- When it sees one piece of information, it partially stores
it in a few parameters
- But not enough to retrieve it later!
- When it sees it again, it accumulates it in its weights
\(\rightarrow\)
Memorization
- If it never sees it again, it will be overriden \(\rightarrow\)
Forgetting
- How many times shall a piece of information be seen?
- Finetuning hardly learns new knowledge:
- small data \(\rightarrow\) not
enough exposure
- Why not repeat 1000x the finetuning dataset?
- Because previous knowledge will be forgotten!
Why doesn’t pretraining forget?
- It does!
- But by shuffling the dataset, each information is repeated all along
training
- So how to add new knowledge?
- continued pretraining: replay + new data
- RAG, external knowledge databases
- LLM + tools (e.g., web search)
- knowledge editing (see ROME, MEND…)
Take home message
- PEFT is used to adapt to a domain, not to add knowledge
- RAG/LLM agents are used to add knowledge (but not at scale)
debug model:
https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607