Prompt engineering
Workflow:
- define tasks
- write prompts
- test prompts
- evaluate results
- refine prompts; iterate from step 3
Template of prompt
<OBJECTIVE_AND_PERSONA>
You are a [insert a persona, such as a "math teacher" or "automotive expert"]. Your task is to...
</OBJECTIVE_AND_PERSONA>
<INSTRUCTIONS>
To complete the task, you need to follow these steps:
1.
2.
...
</INSTRUCTIONS>
------------- Optional Components ------------
<CONSTRAINTS>
Dos and don'ts for the following aspects
1. Dos
2. Don'ts
</CONSTRAINTS>
<CONTEXT>
The provided context
</CONTEXT>
<OUTPUT_FORMAT>
The output format must be
1.
2.
...
</OUTPUT_FORMAT>
<FEW_SHOT_EXAMPLES>
Here we provide some examples:
1. Example #1
Input:
Thoughts:
Output:
...
</FEW_SHOT_EXAMPLES>
<RECAP>
Re-emphasize the key aspects of the prompt, especially the constraints, output format, etc.
</RECAP>
Notes
- few-shot examples are mainly used to define the format, not the
content!
- attribute a role that is relevant for the task
- use prefixes for simple prompts:
TASK:
Classify the OBJECTS.
CLASSES:
- Large
- Small
OBJECTS:
- Rhino
- Mouse
- Snail
- Elephant
- use XML or JSON for complex prompts
Ask for explanations
What is the most likely interpretation of this sentence? Explain
your reasoning. The sentence: “The chef seasoned the chicken and put it
in the oven because it looked pale.”
- Llama3.1-7b: “[…] the chef thought the chicken was undercooked or
not yet fully cooked due to its pale appearance […]”
CoT for complex tasks
Extract the main issues and sentiments from the customer feedback on our telecom services.
Focus on comments related to service disruptions, billing issues, and customer support interactions.
Please format the output into a list with each issue/sentiment in a sentence, separated by semicolon.
Input: CUSTOMER_FEEDBACK
Classify the extracted issues into categories such as service reliability, pricing concerns, customer support quality, and others.
Please organize the output into JSON format with each issue as the key, and category as the value.
Input: TASK_1_RESPONSE
Generate detailed recommendations for each category of issues identified from the feedback.
Suggest specific actions to address service reliability, improving customer support, and adjusting pricing models, if necessary.
Please organize the output into a JSON format with each category as the key, and recommendation as the value.
Input: TASK_2_RESPONSE
CoT workflow
- break the pb into steps
- find good prompt for each step in isolation
- tweak the steps to work well together
- enhance with finetuning:
- generate synthetic samples to tune each step
- finetune small LLMs on these samples
Alternative: DSPy
- optimize prompts automatically
- you define the target metric
- DSPy uses
LM-driven optimizers to tune prompts and weights
- Vanilla prompting
- Chain-of-thought (CoT)
- Self-consistency
- Ensemble refinment
- Automatic chain-of-thought (Auto-CoT)
- Complex CoT
- Program-of-thoughts (PoT)
- Least-to-Most
- Chain-of-Symbols (CoS)
- Structured Chain-of-Thought (SCoT)
- Plan-and-solve (PS)
- MathPrompter
- Contrastive CoT/Contrastive self-consistency
- Federated Same/Different Parameter self-consistency/CoT
- Analogical reasoning
- Synthetic prompting
- Tree-of-toughts (ToT)
- Logical Thoughts (LoT)
- Maieutic Prompting
- Verify-and-edit
- Reason + Act (ReACT)
- Active-Prompt
- Thread-of-thought (ThOT)
- Implicit RAG
- System 2 Attention (S2A)
- Instructed prompting
- Chain-of-Verification (CoVe)
- Chain-of-Knowledge (CoK)
- Chain-of-Code (CoC)
- Program-Aided Language Models (PAL)
- Binder
- Dater
- Chain-of-Table
- Decomposed Prompting (DeComp)
- Three-Hop reasoning (THOR)
- Metacognitive Prompting (MP)
- Chain-of-Event (CoE)
- Basic with Term definitions
- Basic + annotation guideline + error-analysis
- lists the best
prompting techniques for every possible NLP task.
- “chain of thoughts”:
- decompose a difficult task into steps
- only works with large models (>100GB)
- Applied to solve 2nd grade math problems
CoT requires large models:
- Self-consistency greatly improves CoT prompting
- For one (CoT prompts, question) input, sample multiple outputs
- take majority vote among outputs
Analogy solving:
Directions: In the following question, a related pair of
words or phrases is followed by five pairs of words or
phrases. Choose the pair that best expresses a relationship
similar to that in the original pair.
braggart :: modesty
A) fledgling : experience
B) embezzler : greed
C) wallflower : timidity
D) invalid : malady
E) candidate : ambition
To solve this problem, first we need to understand the
relationship that exists between braggart and modesty.
According to the sentence, 'braggart' is a person who talks
too much about himself or herself and is usually not
believed. On the other hand, 'modesty' is the opposite of
this and denotes a person who does not talk too much about
himself or herself. Thus, for 'modesty' to be a suitable
answer, it should show the opposite of 'braggart'.
Now let's see whether each pair expresses a relationship
similar to that between braggart and modesty.
Next we have 'fledgling', which means a person who is
inexperienced and 'experience' which means knowledge gained
through practical involvement. Thus, 'fledgling' is a person
who has no experience and 'experience' is knowledge gained
through practical involvement. Thus, 'fledgling' is the
opposite of 'experience'. The relationship between these two
words is similar to that between braggart and modesty, hence
'fledgling' is the answer.
- Finetuning = continue training the AI model on domain-specific data
- The training objective may change (e.g., new image
classification)
- Or it may stay the same as pretraining (e.g., language
modeling)
- Pretraining \(\rightarrow\)
Foundation models
- Finetuning \(\rightarrow\)
Domain-specific models
- Why not just training a small model from scratch on the target
domain?
- Transfer learning: we expect to transfer
capabilities from the generic AI to get a better target model
- Small data: we often don’t have enough domain data
to train a small model from scratch, but specializing the generic AI
model usually requires few data
- Stochastic Gradient Descent (SGD) algorithm:
- You need a training corpus \(C =
\{x_i,y_i\}_{1\leq i\leq N}\)
- Initialize the model’s parameters randomly: \(\theta_i \sim
\mathcal{N}(0,\mu,\Sigma)\)
- Forward pass: sample one example \(x_i \sim \mathcal{U}(C)\) and predict its
output: \(\hat y=f_{\theta}(x_i)\)
- Compute the loss = error made by the model: \[l(\hat y, y_i) = ||\hat y - y_i||^2\]
- Backward pass: compute the gradient of the loss
with respect to each parameter: \[\nabla
l(\hat y, y_i) = \left[ \frac {\partial l(\hat y, y_i)}{\partial
\theta_k}\right]\]
- Update parameters: \(\theta_k \leftarrow
\theta_k - \epsilon \frac {\partial l(\hat y, y_i)}{\partial
\theta_k}\)
- Iterate from the forward pass
- Backpropagation algorithm (for the backward pass):
- Compute the derivative of the loss wrt the output: \(\frac {\partial l(\hat y, y_i)}{\partial
\theta_T}\)
- Use the chain rule to deduce the derivative of the loss after the op
just before: \[\frac {\partial l(\hat y,
y_i)}{\partial \theta_{T-1}} = \frac {\partial l(\hat y, y_i)}{\partial
\theta_T} \times
\frac {\partial \theta_T}{\partial \theta_{T-1}}\]
- Only requires to know the analytic derivative of each op
individually
- Iterate back to the input of the model
Motivation for PEFT
- PEFT = Parameter-Efficient Fine-Tuning
- It’s just finetuning, but cost-effective:
- only few parameters are finetuned
- cheaper to train
- cheaper to distribute
When do we need finetuning?
- Improve accuracy, adapt LLM behaviour
- Finetuning use cases:
- Follow instructions, chat…
- Align with user preferences
- Adapt to domain: healthcare, finance…
- Improve on a target task
- So finetuning is just training on more data?
- Yes:
- Same training algorithm (SGD)
- No:
- different hyperparms (larger learning rate…)
- different type of data
- higher quality, focused on task
- far less training data, so much cheaper
- not the same objective:
- adaptation to domain/style/task/language…
Pretrained LLM compromise
- Training an LLM is fundamentally a compromise:
- training data mix: % code/FR/EN…
- text styles: twitter/books/PhD…
- Pretraining data mix defines where the LLM excels
- Finetuning modifies this equilibrum to our need
- The art of pretraining:
- finding the balance that fits most target users’ expectation
- finding the balance that maximizes the LLM’s capacities +
adaptability
- e.g., pretraining only on medical data gives lower performance even
in healthcare, because of limited data size and lack of variety.
- But for many specialized tasks, pretrained LLM does not give the
best performance:
- Finetuning adapts this compromise
- So finetuning is required for many specialized domains:
- enterprise documentations
- medical, finance…
- But it is costly to do for large LLMs:
- collecting, curating, cleaning, formatting data
- tracking training, preventing overfitting, limiting forgetting
- large LLMs require costly hardware to train
- For instance, finetuning LLama3.1-70b requires GPUs with approx. 1TB
of VRAM
- Can’t we avoid finetuning at all, but still adapt the LLM to our
task?
If the LLM good enough, no need to finetune?
- Alternative: prompting
- “Be direct and answer with short responses”
- “Play like the World’s chess champion”
- Alternative: memory/long context/RAG
- “Adapt your answers to all my previous interactions with you”
- Alternative: function calling
- “Tell me about the events in July 2024”
Is it possible to get a good enough LLM?
- more data is always best (even for SmolLM!)
- So why not training the largest LLM ever on all data and use it
everywhere?
- Usage cost
- Obsolescence
- Data bottleneck
- So far, not good enough for most cases!
- Better approach (in 2024):
- For each task (domain, language):
- gather “few” data
- adapt an LLM to the task
- Because it is done multiple times, training costs become a
concern
- Parameter-efficient training (PEFT)
Which pretrained LLM to
finetune?
- Option 1: large LLM
- benefit from best capacities
- fine for not-so-much specialized tasks
- high cost
- Option 2: “small” LLM
- fine for specialized task
- low cost
- hype: small agent LLMs, smolLM
- larger LLM \(\rightarrow\) less
forgetting
Challenges
- Choose pretrained LLM
- Depends on the task and expected performance, robustness…
- Collect quality data
- Finetuning data must be high quality!
- Format data
- Format similar to final task
- FT on raw text may impact instruction following
- Track & prevent overfitting, limit forgetting
- Cost of finetuning may be high
Cost
- Cost of inference << cost of finetuning << cost of
pretraining
- quantization: we don’t know (yet) how to finetune well quantized
LLMs; so finetuning requires 16 or 32 bits
- inference: no need to store all activations: compute each layer
output from it’s input only
- inference: no need to store gradients, momentum
- Inference can be done with RAM = nb of parameters / 2
- Full finetuning requires RAM = \(11\times\) nb of parameters (according to
Eleuther-AI), \(12-20\times\) according
to UMass
- 1 parameter byte = +1B (gradient) + 2B (Adam optimizer state: 1st
and 2nd gradient moments) (see next slide)
- Can be reduced to \(\simeq
5\times\):
- gradient checkpointing
- special optimizers (1bitAdam, Birder…)
- offloading…
- Adam equations:
- \(m^{(t)} = \beta_1 m^{(t-1)} +
(1-\beta_1) \nabla L(\theta^{(t-1)})\)
- \(v^{(t)} = \beta_2 v^{(t-1)} +
(1-\beta_2) \left(\nabla L(\theta^{(t-1)})\right)^2\)
- Bias correction:
- \(\hat m^{(t)} = \frac
{m^{(t)}}{1-\beta_1}\)
- \(\hat v^{(t)} = \frac
{v^{(t)}}{1-\beta_2}\)
- \(\theta^{(t)} = \theta^{(t-1)} -
\lambda\frac{\hat m^{(t)}} {\sqrt{\hat v^{(t)}} +
\epsilon}\)
- PEFT greatly reduce RAM requirements:
- can keep LLM parameters frozen and quantized (qLoRA)
- store gradients + momentum only in 1% of parameters
- But:
- still need to backpropagate gradients through the whole LLM and save
all activations
- with large data, PEFT underperforms full finetuning
VRAM usage
Full |
32 |
120GB |
240GB |
600GB |
1200GB |
2000GB |
900GB |
2400GB |
Full |
16 |
60GB |
120GB |
300GB |
600GB |
900GB |
400GB |
1200GB |
LoRA/GaLore/BAdam |
16 |
16GB |
32GB |
64GB |
160GB |
240GB |
120GB |
320GB |
QLoRA |
8 |
10GB |
20GB |
40GB |
80GB |
140GB |
60GB |
160GB |
QLoRA |
4 |
6GB |
12GB |
24GB |
48GB |
72GB |
30GB |
96GB |
QLoRA |
2 |
4GB |
8GB |
16GB |
24GB |
48GB |
18GB |
48GB |
Training methods
Pretraining |
>10T |
Full training |
Cont. pretr. |
\(\simeq
100\)b |
update: PEFT? |
Finetuning |
1k … 1b |
Adapt to task: PEFT |
Few-Shot learning |
< 1k |
Guide, help the LLM |
Wrap-up
- With enough compute, prefer full-finetuning
- HF transformer, deepspeed, llama-factory, axolotl…
- With 1 “small” GPU, go for PEFT
- Without any GPU: look for alternatives
PEFT methods
- do not finetune all of the LLM parameters
- finetune/train a small number of (additional) parameters
We’ll focus on a few
- Additive finetuning: add new parameters
- Adapter-based: sequential adapter
- soft-prompt: prefix tuning
- others: ladder-side-networks
- Partial finetuning: modify existing parameters
- Lottery-ticket sparse finetuning
- Reparameterization finetuning: “reparameterize” weight matrices
- Hybrid finetuning: combine multiple PEFT
- manually: MAM, compacter, UniPELT
- auto: AutoPEFT, S3Delta-M
- Unified finetuning: unified framework
- AdaMix: MoE of LoRA or adapters
- SparseAdapter: prune adapters
- ProPETL: share masked sub-nets
Sequential adapters
\[X=(RELU(X\cdot W_{down})) \cdot W_{up} +
X\]
with
\[W_{down} \in R^{d\times k}~~~~W_{up} \in
R^{k\times d}\]
- Interesting extensions
- Parallel Adapter (parallel peft > sequential peft)
- CoDA: skip tokens in
the main branch, not in the parallel adapter
- Tiny-Attention
adapter: uses small attn as adapter
- Adapter
Fusion: (see next slide)
- Train multiple adapters, then train fusion
Prefix tuning
- Concat \(P_k,P_v \in R^{l\times
d}\) before \(K,V\) \[head_i = Attn(xW_q^{(i)},
concat(P_k^{(i)},CW_k^{(i)}), concat(P_v^{(i)},CW_v^{(i)})\]
- with \(C=\)context, \(l=\)prefix length
- ICLR22 shows some
form of equivalence:
- Advantages:
- More expressive than adapters, as it modifies every attention
head
- One of the best PEFT
method at very small parameters budget
- Drawbacks:
- Does not benefit from increasing nb of parameters
- Limited to attention head, while adapters may adapt FFN…
- … and adapting FFN is always better
Performance comparison
qLoRA = LoRA + quantized LLM
- Advantages:
- de facto standard: supported in nearly all LLM frameworks
- Many extensions, heavily developped, so good performances
- can be easily merged back into the LLM
- Drawbacks:
Adapter lib v3
- AdapterHubv3
integrates several family of adapters:
- Bottleneck = sequential
- Compacter = adapter with Kronecker prod to get up/down matrices
- Parallel
- Prefix, Mix-and-Match = combination Parallel + Prefix
- Uniformisation of PEFT functions: add_adapter(),
train_adapter()
- heads after adapters: add_classification_head(),
add_multiple_choice_head()
- In
HF lib, you can pre-load multiple adapters and select one
active:
model.add_adapter(lora_config, adapter_name="adapter_1")
model.add_adapter(lora_config, adapter_name="adapter_2")
model.set_adapter("adapter_1")
Ladder-side-networks
- Advantages:
- Do not backprop in the main LLM!
- Only requires forward passes in the main LLM
- Drawbacks:
- LLM is just a “feature provider” to another model
- \(\simeq\) enhanced
“classification/generation head on top”
- Forward pass can be done “layer by layer” with “pipeline
parallelism”
- load 1 layer \(L_i\) in RAM
- pass the whole corpus \(y_i=L_i(x_i)\)
- free memory and iterate with \(L_{i+1}\)
- LST: done only once for the whole training session!
- This approach received an outstanding award at ACL’2024:
Partial finetuning
- Add a linear layer on top and train it
- You may further backprop gradients deeper in the top-N LLM layers
- … Or just FT the top-N layers without any additional parameters
- Simple, old-school, it usually works well
- Fill the continuum between full FT and classifier head FT:
- can FT top 10%, 50%, 80% params
- or FT bottom 10%, 50% params
- or FT intermediate layers / params
- or apply a sparse mask?
Lottery-ticket sparse
finetuning
- Lottery Ticket
Hypothesis:
- Each neural network contains a sub-network (winning ticket) that, if
trained again in isolation, matches the performance of the full
model.
- Advantages:
- Can remove 90% parameters nearly without loss in performances (on
image tasks)
- Drawbacks:
- Impossible to find the winning mask without training first the large
model
can be applied to sparse
FT
FT an LLM on specific task/lang
extract the mask = params that change most
rewind the LLM and re-FT with mask
sparse finetunes can be combined without overlapping!
Wrap-up
- Various PEFT methods:
- Reduce model storage? RAM requirements?
- Require backprop through the LLM?
- Additional inference cost?
Finetuning (PEFT or full):
advantages
- greatly improve performances on a target task, language, domain
- dig knowledge up to the surface, ready to use
- give the LLM desirable capacities: instruction-following, aligned
with human preferences…
Finetuning (PEFT or full):
drawbacks
Memorization, forgetting
Pretraining and FT use same basic algorithm (SGD), but the
differences in data size lead to differences in training regimes.
- Difference in scale:
- Pretraining ingests trillions of tokens
- Finetuning uses up to millions of tokens
- This leads to differences in regimes / behaviour:
- Pretraining learns new information
- Finetuning exhumes information it already knows
Why such a difference in regimes?
- Because of the way SGD works:
- When it sees one piece of information, it partially stores
it in a few parameters
- But not enough to retrieve it later!
- When it sees it again, it accumulates it in its weights
\(\rightarrow\)
Memorization
- If it never sees it again, it will be overriden \(\rightarrow\)
Forgetting
- How many times shall a piece of information be seen?
- Finetuning hardly learns new knowledge:
- small data \(\rightarrow\) not
enough exposure
- Why not repeat 1000x the finetuning dataset?
- Because previous knowledge will be forgotten!
Why doesn’t pretraining forget?
- It does!
- But by shuffling the dataset, each information is repeated all along
training
- So how to add new knowledge?
- continued pretraining: replay + new data
- RAG, external knowledge databases
- LLM + tools (e.g., web search)
- knowledge editing (see ROME, MEND…)
Take home message
- PEFT is used to adapt to a domain, not to add knowledge
- RAG/LLM agents are used to add knowledge (but not at scale)
When to use PEFT?
Pretraining |
>10T |
Full training |
Finetuning |
1k … 1b |
Adapt to task: PEFT |
Continual learning |
1k … 1b |
update the LLM: PEFT? |
Few-Shot learning |
< 1k |
Guide, help the LLM |
- Choose PEFT when constrained by available hardware:
- single GPU with VRAM<40GB, LLM larger than 1b –> PEFT!
- example: Adaptation to French
- full-finetune of 7b LLM on 1x 80GB-A100: CLAIRE
- CroissantLLM…
What can be done without any
GPU?
- Running LLM is OK: see llama.cpp, llamafile, ollama…
- Avoid finetuning at all costs!
- llama.cpp: qLoRA supported, but not mature
- too slow to be usable
- With enough compute, prefer full-finetuning
- HF transformer, deepspeed, llama-factory, axolotl
- With 1 “small” GPU, go for PEFT (qLoRA)
- Without any GPU: look for alternatives (ICL, RAG…)