\[p_t^{(i)} = \begin{cases} sin(w_k \cdot t), \text{if }i=2k \\ cos(w_k \cdot t), \text{if }i=2k+1 \end{cases}\]
with \(d\) encoding dim and
\[w_k = \frac 1 {10000^{2k/d}}\]
Update Oct 2024: Transformers Learn Higher-Order Optimization Methods for In-Context Learning - They learn exponentially better learning algorithms than SGD, apparently similar to Iterative Newton’s method
BloomChat (Together.AI 2023)
Microsoft study (Nov. 2023)
Objectives:
Analyzing an LLM
Baidu paper 2017
Open-AI 2020
Chinchilla paper 2022
\(L=\) pretraining loss
Google 2022: paper1, paper2 Flops, Upstream (pretraining), Downstream (acc on 17 tasks), Params
GPT3 paper 2020
"eat" becomes "ate"
"draw" becomes "drew"
"vote" becomes
Anthropic paper 2022
Jason Wei has exhibited 137 emerging capabilities:
During training, LLMs may abruptly reorganize their latent representation space
Grokking exhibits structured latent space
First proof that overfitting may be addressed by increasing the nb of parameters!
Objectives: Create your own scaling law
(from huggingface)
\[h_t = Ah_{t-1} + Bx_t\] \[y_t = Ch_t + D x_t\]
Ex: Mamba
(from PMC24)
Principle:
Which tasks for training?
Multilingual models:
Key ingredients to success:
Data P., model sharding, Tensor P., Sequence P., pipeline P…
Designing a good prompt is an art:
This is a novel written in the style of J.R.R. Tolkien’s Lord of the Rings fantasy novel trilogy. It is a parody of the following passage:
“S. Jane Morland was born in Shoreditch …”
Tolkien rewrote the previous passage in a high-fantasy style, keeping the same meaning but making it sound like he wrote it as a fantasy; his parody follows:
English: Writing about language models is fun. Roish: Writingro aboutro languagero modelsro isro funro. English: The weather is lovely! Roish:
CoT requires large models:
Analogy solving:
Directions: In the following question, a related pair of
words or phrases is followed by five pairs of words or
phrases. Choose the pair that best expresses a relationship
similar to that in the original pair.
braggart :: modesty
A) fledgling : experience
B) embezzler : greed
C) wallflower : timidity
D) invalid : malady
E) candidate : ambition
To solve this problem, first we need to understand the
relationship that exists between braggart and modesty.
According to the sentence, 'braggart' is a person who talks
too much about himself or herself and is usually not
believed. On the other hand, 'modesty' is the opposite of
this and denotes a person who does not talk too much about
himself or herself. Thus, for 'modesty' to be a suitable
answer, it should show the opposite of 'braggart'.
Now let's see whether each pair expresses a relationship
similar to that between braggart and modesty.
Next we have 'fledgling', which means a person who is
inexperienced and 'experience' which means knowledge gained
through practical involvement. Thus, 'fledgling' is a person
who has no experience and 'experience' is knowledge gained
through practical involvement. Thus, 'fledgling' is the
opposite of 'experience'. The relationship between these two
words is similar to that between braggart and modesty, hence
'fledgling' is the answer.
1- chunk the long seq, encode each chunk independently 2- for each token to generate: - the long-context input is composed of (long) past KV-cache + current tokens - the KV-cache is composed of: - (small) initial sequence is kept - (long) evicted sequence - evicted KV are stored in external memory - at test time, a lookup f() selects KV from external memory to add to the small context
There’s some hope though…
Loss: UL2
MAGMAX: Leveraging Model Merging for Seamless Continual Learning
Forgetting by finetuning does not work: LLMs don’t forget internally: https://arxiv.org/abs/2409.02228
finetune/train a small number of (additional) parameters
… Or just FT the top-N layers without any additional parameters
View the LLM as a features provider
Method | Bits | 7B | 13B | 30B | 70B | 110B | 8x7B | 8x22B |
---|---|---|---|---|---|---|---|---|
Full | AMP | 120GB | 240GB | 600GB | 1200GB | 2000GB | 900GB | 2400GB |
Full | 16 | 60GB | 120GB | 300GB | 600GB | 900GB | 400GB | 1200GB |
LoRA/GaLore/BAdam | 16 | 16GB | 32GB | 64GB | 160GB | 240GB | 120GB | 320GB |
QLoRA | 8 | 10GB | 20GB | 40GB | 80GB | 140GB | 60GB | 160GB |
QLoRA | 4 | 6GB | 12GB | 24GB | 48GB | 72GB | 30GB | 96GB |
QLoRA | 2 | 4GB | 8GB | 16GB | 24GB | 48GB | 18GB | 48GB |
Method | data size | notes |
---|---|---|
Pretraining | >10T | Full training |
Finetuning | 1k … 1b | Adapt to task: PEFT |
Continual learning | 1k … 1b | update the LLM: PEFT? |
Few-Shot learning | < 1k | Guide, help the LLM |
Great pedagogical point of view about LLM by Sasha Rush: video
calcul des Flops: https://kipp.ly/transformer-inference-arithmetic/