Transformers, scaling law

Attention models

  • Intuition of hard attention:
    • Imagine a long text about history
    • You observe a new short text about year 1515
    • Your model shall focus / pick info from the relevant passage in the long text
  • How do you know if a passage is relevant?
    • compute a similarity btw passage and query
    • similarity = dot-product
  • Select only the passage with largest similarity
  • Intuition of soft attention:
    • Pick info from all passages, but weight them by their relevance
    • relevance = normalized similarity (by softmax)
  • Attention can be applied to many cases:
    • long texts, banks of vectors, output sequences of RNN…
  • more generally:
  • Let be a bank of vectors (\(k_i\),\(v_i\))
  • Let \(q\) be a new vector: what is its value?
  • Compute the distance between \(q\) and every \(k_i\): \(q \cdot k_i\)
  • Normalize with softmax: \(\alpha_i = Softmax (q \cdot k_i)\)
  • Compute the target value: \(v = \sum_i \alpha_i v_i\)

Self-attention

  • attention of every \(x_t\) against all \(\{x_{t'}\}_{1\leq t'\leq T}\)
  • so every token generates a new “smoothed” version of itself, smoothed by all other tokens
  • so self-attention can be viewed as a layer that transforms an input seq into an output seq of the same length
  • pb: does not consider position at all (for now)
    • distance is irrelevant
  • relies on tokens content only
  • nb of parameters does not depend on seq length
  • For each target token:
    • Compute its dot-product with every other token
    • Normalize them all with softmax
    • Replace its vector by the weighted average of all tokens
  • Can be done with 2 inner FOR loops, or with matrix algebra
  • Let \(X_t\) be an embedding of input token
  • 3 matrices (the parameters) are used to transform \(X_t\) into
    • a query vector: \(Q_t = X_t^T W_Q\)
    • a key vector: \(K_t = X_t^T W_K\)
    • a value vector: \(V_t = X_t^T W_V\)
  • Computes the distances between \(Q_t\) and every \(K_{t'}\): \[a_{t'} = Q_t \cdot K_{t'}^T\]
  • Normalize to get the attention vector: \[\alpha = Softmax(a_1,\cdots,a_T)\]
  • gives degree of relatedness btw token \(X_t\) and all other tokens
  • Often represented by a heat map over the words of the sentence

Expressing dot-product with matrices:

\(Q=\begin{bmatrix} ~~~~~Q_1^T~~~~~\\ \hline \cdots\\ \hline ~~~~~Q_T^T~~~~~ \end{bmatrix}\) \(K=\begin{bmatrix} ~~~~~K_1^T~~~~~\\ \hline \cdots\\ \hline ~~~~~K_T^T~~~~~ \end{bmatrix}\)

\[\alpha = Softmax(QK^T)\]

  • warning: softmax over which dimension?

Scaled dot-product

  • \(q\) and \(k\) are \(d_k\)-dimensional vectors
  • assume \(q_i\) has mean 0 and var 1
  • then \(q\cdot k\) has var \(d_k\)
  • so we scale the dot-product to get var=1:

\[\alpha = Softmax\left( \frac{QK^T}{\sqrt{d_k}} \right) \]

  • We give each token a repr \(V_{w}\)
  • Pool them into a matrix for the seq: \(V = [V_{w_1},\dots,V_{w_N}]^T\)
  • We replace every target token repr by the average over its context (weighted by \(\alpha\))

\[ V' = Softmax\left( \frac{QK^T}{\sqrt{d_k}} \right) V\]

  • Caution about the matrices dimensions !

Multi-head self-attention

  • Repeat with different \(W_Q,W_K,W_V\) to get multiple heads

Positional Encodings

  • Self-attention gives the same repr when you shuffle the words !
  • Inject information about position through a vector that encodes the position of each word
  • Naive approaches:
    • \(p=1,\dots,N\): not normalized + never seen N
    • \(p=0,0.06,\dots,1\): \(\Delta p\) depends on sentence length
  • Better approach:
    • inspired by spectral analysis
    • positions are encoded along sinusoidal cycles of various frequencies

\[p_t^{(i)} = \begin{cases} sin(w_k \cdot t), \text{if }i=2k \\ cos(w_k \cdot t), \text{if }i=2k+1 \end{cases}\]

with \(d\) encoding dim and

\[w_k = \frac 1 {10000^{2k/d}}\]

  • Remarks:
    • by giving positional encodings the same dimension as word embeddings, we can sum them together
    • most positional information is encoded in the first dimensions, so summing them with word embeddings enable the model to “let” the first dimensions free of semantics and dedicate them to positions.
  • Other types of positional encodings:
    • train 1 embed per absolute position
    • Alibi
    • Rotary

Transformer: “Attention is all you need”

  • Add a normalization after self-attention
    • parametric center+scale 1 vec across dimensions
    • better gradient (cf paper)

  • Add another MLP to transform the output
  • Add residual connections: smooth loss landscape
  • Stack this block \(N\) times

  • Stack this block \(N\) times
  • not so much about abstraction
  • about redundancy
  • about having multiple paths
  • The transformer is designed for Seq2Seq
    • So it contains both an encoder and decoder
  • Same approach for decoder
    • with cross-attention from encoder to decoder matching layers
    • with masks to prevent decoder from looking at words \(t+1, t+2\dots\) when predicting \(t\)
  • GPT family: only the decoder stack
  • BERT family: only the encoder stack (+ classifier on top)
  • T5, BART family: enc-dec

QCM4

Q001 Q002 Q003 Q004 Q005 Q006 Q007 Q008 Q009
77.00% 69.00% 69.00% 85.00% 100.00% 0.00% 62.00% 92.00% 54.00%
Q010 Q011 Q012 Q013 Q014 Q015 Q016 Q017 Q018
100.00% 38.00% 77.00% 62.00% 46.00% 23.00% 77.00% 85.00% 38.00%

Self-supervised training

  • Manual annotations into supervised classes (e.g., emotion) is costly
  • Find an auxiliary task to learn representations
  • This task must force the model to learn useful information
  • Ex. for GPT:
    • Language modeling taks:
    • predict the next word \(p(x_t|x_0,\dots,x_{t-1})\)
  • Ex. for BERT:
    • Masked language modeling:
    • predict a masked word \(p(x_t|x_0,\dots,x_{t-1},x_{t+1},\dots,x_T)\)
  • Ex. for ViT:

  • Quest of models able to “absorb” huge datasets
    • Vision, speech: ConvNets (2012)
      • “Absorb” ImageNet
      • Higher layers == more abstract visual features
      • 2021: CoAtNet-7 (Conv + Atten): 2.4 million parameters
  • Quest of models able to “absorb” huge datasets
    • Language: transformers (2018)
      • “Absorb” the written web
      • 2021: Wu Dao 2.0: 1.75 trillion parameters (10x GPT3)
  • Big NLP models absorb:
    • lexicon, syntax, semantics of NL
    • facts described in the web:
      • cooking recipies
      • historical events
      • famous people

Exploiting representations

  • With Zero-Shot learning
  • With few-shot learning
  • As frozen embeddings
  • With fine-tuning

Using LLMs: Zero-Shot Learning

  • “Give me a recipe to cook bacon”
    • “place the bacon in a large skillet and cook over medium heat until crisp.”
  • “The main characters in Shakespeare’s play Richard III are”
    • “Richard of Gloucester, his brother Edward, and his nephew, John of Gaunt”

(More on ZSL later…)

Using LLMs: Few-Shot Learning

  • can be done on a laptop
  • requires very few annotated data
  • a bit better than ZSL

Using LLMs: as frozen embeddings

  • can be done on a laptop
  • requires some annotated data
  • gives good results

Using LLMs: with fine-tuning

  • can be done on a desktop
  • requires some annotated data
  • gives the best results

Instruction finetuning

  • Foundation model = pretrained generic LLM that can be finetuned on various tasks
  • Instruction finetuning = pretrained LLM finetuned on instructions (QA, procedures, chat)
  • RLHF / DPO = finetuned LLM aligned with human values
  • pretrained LLM:
    • has all the knowledge inside
    • is not able to chat, answer questions…
  • instruction finetuned LLM:
    • no more knowledge, but can interact with human
  • aligned LLM:
    • does not say “bad” things

Zero-Shot learning

You may query LLM without further training:

  • “Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy”
    • “Positive”
  • “A is the son’s of B’s uncle. What is the family relationship between A and B?”
    • “cousins”
  • “A is the son’s of B’s uncle. What is B for A?”
    • “brother”
  • “On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book. The red book is to the right of the gray book. The black book is to the left of the blue book. The blue book is to the left of the gray book. The purple book is the second from the right. Which book is the leftmost book?”
    • “The black book”

Prompt programming

  • is the art of designing prompt to perform a task
  • Prompting may be viewed as a way to constraint the generation
    • You may describe the task
    • You may give examples (few-shots)
    • You may give an imaginary context to “style” the result
  • How to describe the task:
    • direct task description
    • proxy task description
  • Direct task description:
    • “translate French to English”
  • Can be contextual:
    • “French: … English: …”
  • Direct description can combine tasks the model must know:
    • “rephrase this paragraph so that a 2nd grade can understand it, emphasizing real-world applications”
  • Proxy task description

This is a novel written in the style of J.R.R. Tolkien’s Lord of the Rings fantasy novel trilogy. It is a parody of the following passage:

“S. Jane Morland was born in Shoreditch …”

Tolkien rewrote the previous passage in a high-fantasy style, keeping the same meaning but making it sound like he wrote it as a fantasy; his parody follows:

  • Few-shot prompts:

English: Writing about language models is fun. Roish: Writingro aboutro languagero modelsro isro funro. English: The weather is lovely! Roish:

  • “chain of thoughts”:
    • “think step by step…”
    • decompose a difficult task into steps
    • only works with large models (>100GB)
    • Efficient to solve math problems
  • New types of prompts discovered regularly:
    • Program of thoughts
    • Tree of thoughts
    • Analogical prompting
    • Prompting to verify its output
    • “Take a deep breath”
    • “This is important for my career”…
  • Going further:
  • Other applications of these models:
    • Vision Transformers (ViT)
    • protein folding (Nature: Jumper et al. 2021)
    • Code generation: Github Co-pilot
    • Image generation: DALL-E

https://web.stanford.edu/class/cs25/index.html

Transformers

  • Ubiquitous in Natural Language Processing
  • Dominating in Image Processing:

les LLM, une nouvelle recherche empirique

Des observations aux prémisses de théories

  • Observations:
    • Mémorisation, compression, structuration et généralisation
    • Capacités émergentes
  • Embryons de théories: Passage à l’échelle
    • Lois de Chinchilla
    • Grokking
    • Double descente
    • Transitions de phases, singularité…
  • Why are transformers so good?
  • Enable the model to scale:
    • with reasonable depth (<100)
    • demultiply the number of “paths” information can flow
    • but still share parameters across paths
  • RNN: cannot scale (training cost)
  • CNN: information only compressed through a “local funnel”, no paths
  • MLP: not efficient (lack of inductive bias)

Scaling laws

Scaling property of LLMs

  • scaling = if you add parameters, it can store more information
  • controlled by measuring its performances on tasks
  • scaling law = power law = \(y(x) = ax^{-\gamma} +b\)
  • metric = test loss
  • \(\gamma\) = slope

Baidu paper 2017

GPT3-175b paper 2020

  • emergence of In-Context Learning !
    • provide examples of a task in the context, the output mimics the examples

Scaling laws for Neural LM 2020

Open-AI 2020

  • RL, protein, chemistry…
  • Scaling exist since long in ML Paper on learning curves
  • But reducing test loss is linked to emergent abilities in transformers
    • Such emergence never been observed in ML before

Chinchilla paper 2022

  • GPT3: train on 300b tokens, and scale parameters up
  • given fixed FLOPS, optimal balance btw dataset / parameters
  • Lesson: need more data!

\(L(N)=\frac A {N^\alpha}\)

  • \(N\) = dataset size
  • \(\alpha \simeq 0.5\) (was 0.05 in GPT3 paper)

  • Any really useful enhancement of transformers since 2017?
    • FlashAttention
    • Pre-layer norm
    • Parallel feed-forward and attention
    • Rotary + Alibi positional encodings

Anthropic paper 2022

  • smooth scaling results from combination of abrupt emergences

Importance of scale

  • Large number of parameters are required to simply store all the factual information
  • But LLMs also compress information
    • and to efficiently compress, the LLM needs to store the algorithm instead of the data
  • Some new abilities of transformers only emerge after a certain scale:
    • Nobel-prize Philip Anderson: “Emergence is when quantitative changes in a system result in qualitative changes in behavior.”
  • for language: “Emergent Abilities of Large Language Models”

Emergence of few-shot prompting

Emergence of other abilities

Scaling with Mixture of Experts

  • Sparse activations per input
  • Each input goes through same amount of computation, but different paths
  • \(N\) experts \(E_i(x)\)
  • One gating network: \(p_i(x) = \frac {e^{h(x)_i}}{\sum_{j=1}^{M<N} e^{h(x)_j}}\)
  • First success in Translation: Switch Transformer
    • replace FFN in transformers with MoE
    • choose single highest-prob expert
  • Issue: unstable training; remedy:
    • selective precision: 32 bits for exp computation, 15 bits for others
    • reduce init scale
    • more dropout for experts
  • MoE requires auxiliary loss to balance the load across experts (Shazeer et al., 2018)

Scaling with finetuning

  • Small data regime: finetuning better than training

but forgetting; hard to inject new knowledge

Opening LLMs

  • a.k.a. foundation models
  • a list, with openness level, available here
  • You have seen:
    • evolution from RNN to transformers
    • details of attention, self-attention, positional embeddings
    • hints at the reasons of their success
    • scaling law: we’re still far from limits
    • new scaling directions: sparse MoE

LLM costs

QCM5

Q001 Q002 Q003 Q004 Q005 Q006 Q007 Q008 Q009
17% 100% 75% 50% 33% 50% 17% 83% 92%
Q010 Q011 Q012 Q013 Q014 Q015 Q016 Q017 Q018
17% 50% 58% 33% 17% 50% 33% 42% 92%

Training costs

  • LLama2-70b: 5M€, 2M GPU.h
  • 1 epoch training:
    • RedPyjama v2: 30T tokens
    • assume 100 toks/s on GPU, how long?
  • Hyper-parms tuning ???
    • cf. muP paper
  • Training huge LLM is very difficult: The LLM training handbook

Inference costs

  • Autoregressive generation
    • practice: estimate time/token when generating on your laptop
  • Serving many requests
    • OpenAI API has become a “standard” for self-serving
    • benchmarks like HELM rely on this API
  • evaluating one model on HELM: $10k
  • distillation
  • unstructured pruning
    • see “Lottery ticket hypothesis”
    • result: sparse model, not GPU-compliant
  • structured pruning
    • see “Sheared Llama”
    • keep similar architecture
  • quantization
    • parameters: usually 16 bits, can get 4 bits
    • see GPTQ, bitsandbytes, AWQ
  • exercice: how much RAM used with?
m = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)
  • optimization (ollama, lama.cpp, vLLM, localai.io…)
  • often quantizes 4 bits by default
  • C++ code to run inference / serve LLM

Finetuning costs

  • Parameter-efficient finetuning
    • adapters
    • soft prompts
    • prefix tuning
    • low-rank adaptation
  • Main LLM parameters are frozen
  • Adapters
    • add a new sub-layer at every LLM layer
    • see Adapter hub
  • Soft prompts
    • Bypass the embedding layer and start the input with a few parameter vectors
    • see tuto
  • Prefix tuning
    • Same as soft prompt, but add new vectors at input of every layer
    • variants: P-tuning…
  • LoRA
    • augment MLP with parallel compress-uncompress layer
    • see blog
    • see HF how to
  • Finetuning with qLoRA
  • Full finetuning requires VRAM = 4x nb parameters
  • Better:
    • quantize in 4 bits, freeze all parameters, just train 16 bits-LoRA
  • See HF how to

Practice: self-attention

  • Create 3 random pytorch tensors, with dimension n: they represent your input sequence of length 3: \(X_1,X_2,X_3\)
  • Create 3 random matrices for queries, keys and values
  • Compute, in one operation, all the key tensors from all input tensors.
  • Do the same for values and queries.
  • Compute the dot-product between the queries Q and all keys K1,K2,K3
  • You should get 3x3 attention scores (scalar)
  • Normalize these attention scores with a softmax
  • Compute the sum of the input value tensors, weighted by their corresponding attention score
  • You obtain the output sequence of vectors.
  • Goal: perform classification after attention
  • Compress this variable-length final sequence into a single fixed-size vector with max-pooling
  • Add a final classification layer to predict 2 classes and put all this inside a pytorch nn.Module model

Limitations

  • Generate random scalar sequences:
    • class A: the sequence is composed of observations uniformely sampled between 0 and 1
    • class B: the sequence is composed of observations uniformely sampled between 1 and 2
  • Train your self-attentive classifier
  • Analyze
  • Generate random sequences:
    • class A: the first half of the sequence is composed of observations uniformely sampled between 0 and 1, and the second half between 1 and 2
    • class B: the first half between 1 and 2, and the second half between 0 and 1
  • Train your self-attentive classifier
  • Analyze

Refs

http://jalammar.github.io/illustrated-transformer/

Practice: attention

Buggy attentive RNN

class ModelAtt(Model):
  def __init__(self):
      super(ModelAtt, self).__init__()
      qnp = 0.1*np.random.rand(self.hiddensize)
      self.q = nn.Parameter(torch.Tensor(qnp))

  def forward(self, x):
      batch_size = x.size(0)
      hidden = self.init_hidden(batch_size)
      steps, last = self.rnn(x, hidden)
      alpha = torch.matmul(steps,self.q)
      alpha = nn.functional.softmax(alpha,dim=1)
      weighted = torch.mul(steps, alpha)
      rep = weighted.sum(dim=1)
      out = self.fc(rep)
      return out, alpha
  • 2-Classif: perturbed sinusoid up vs. down
def f(x,offset):
    return 0.3*math.sin(0.1*x+offset)+0.5

nex=100
nsteps=50
input_seqs = []
target_seqs = []
for ex in range(nex):
    offset = np.random.rand()
    input_seq=[f(x,offset) for x in range(nsteps)]
    cl = np.random.randint(2)
    target_seqs.append(cl)
    if cl==0: perturb = 0.05
    else: perturb = -0.05
    pos=np.random.randint(25,45)
    for t in range(pos,pos+5): input_seq[t]+=perturb
    input_seqs.append(input_seq)

input_seq = torch.Tensor(input_seqs)
input_seq = input_seq.view(nex,nsteps,1)
target_seq = torch.LongTensor(target_seqs)
  • Train attentive RNN:
    • Use 10000 epochs, LR=0.0001 and RMSprop optimizer
    • Use the CrossEntropyLoss() instead of the MSELoss() to learn the two classes
  • Does it learn to predict the two classes correctly? Is learning stable?
  • After training, plot both the input curve and the attention weights, for the first 5 curves: does attention correctly spots the perturbation?
  • Try without the offset: what happens? Does attention spots the perturbation? Explain.
  • Try to find better hyper-parameters so that convergence is faster.
  • Modify the training loop so that random curve generation is generated directly inside the training loop: there is no more any epoch, but only an infinite sequence of random batches: what happens?
  • Try with longer vs. shorter and smaller/bigger perturbations: in which cases does it work or not? How sensitive is the approach to perturbations?

Transformers zoo

  • See huggingface hub: 77k models!
  • NLP:
    • BERT, RoBERTA, ALBERT, GPT2, GPTJ, GPT-NeoX, BART, T5, T0pp…
    • French: CamemBERT, FlauBERT
    • Translation: T5, opus-MT, WMT, MT5
    • QA: longformer, gelectra, XLM
    • Summarization: BART, Pegasus, BigBird…
    • multilingual: XLM-R, MGPT, Bloom…
  • Image classification:
    • ResNet, CLIP-ViT, MIT-B2, SWIMi, DEIT…
  • Image segmentation:
    • Segformer, Detr-ResNet, Maskformer…
  • Image2image:
    • MirNET (enhance), cond-GAN, BiT, DiffusionCLIP, StableDiffusion…
  • Speech reco: wav2vec, HuBERT, Whisper…
  • Audio classif: XLSR, wav2vec, music-generation…
  • Text2Speech: Amadeus, TTS-hifigan, Tacotron2, Jets…
  • Audio2Audio: MTL-mimic, DCCR, DPRNN, Sepformer, DCUNet…
  • Text2image: Stable-diffusion, Waifu-diffusion, AI-dreamer, DALLE-Mega, BigGAN, DOODAD…
  • Image2Text: ViT-GPT2, TROCR, Donut, Flamingo…
  • Document-QA: LayoutLM
  • RL: decision-transformer, PPO, MLAgents…

Practice: huggingface / fairseq

  • how to use a transformer from HF

  • how to finetune

  • challenges: largest models

  • Go through “Huggingface lab”

Practice: scaling loss