Transformer and LLM

Transformer

  • Transformer: (Vaswany et al., Google, 2017)

Details of the stack

  • Bottom:
    • convert input tokens to static embedding vectors \(X_t\)
      • table lookup in Embeddings matrix
      • Embeddings trained along with all other parameters (\(\neq\) contrastive)
    • 3 matrices transform input embeddings \(X\) into \(Q,K,V\)

Positional Encodings

  • Self-attention gives the same repr when you shuffle the words !
  • Inject information about position through a vector that encodes the position of each word
  • Naive approaches:
    • \(p=1,\dots,N\): not normalized + never seen N
    • \(p=0,0.06,\dots,1\): \(\Delta p\) depends on sentence length
  • Better approach:
    • inspired by spectral analysis
    • positions are encoded along sinusoidal cycles of various frequencies

\[p_t^{(i)} = \begin{cases} sin(w_k \cdot t), \text{if }i=2k \\ cos(w_k \cdot t), \text{if }i=2k+1 \end{cases}\]

with \(d\) encoding dim and

\[w_k = \frac 1 {10000^{2k/d}}\]

  • Remarks:
    • by giving positional encodings the same dimension as word embeddings, we can sum them together
    • most positional information is encoded in the first dimensions, so summing them with word embeddings enable the model to “let” the first dimensions free of semantics and dedicate them to positions.

Multi-head self-attention

  • several attentions in parallel
    • concat outputs after self-att

Normalisation

  • Add a normalization after self-attention

MLP = feed forward

  • Add another MLP to store/inject knowledge
  • Add residual connections: smooth loss landscape

Layer

  • Stack this block \(N\) times
  • Gives time to reason = execution steps of a program
  • Enables redundancy: Mechanistic interpretability:
    • developped by Anthropic AI
    • multiple/concurrent circuits

Encoder-decoder

  • The transformer is designed for Seq2Seq
    • So it contains both an encoder and decoder
  • Same approach for decoder
    • with cross-attention from encoder to decoder matching layers
    • with masks to prevent decoder from looking at words \(t+1, t+2\dots\) when predicting \(t\)
  • GPT family: only the decoder stack
  • BERT family: only the encoder stack (+ classifier on top)
  • T5, BART family: enc-dec
  • pure encoders (BERT) have been superseded by enc-dec (T5)
    • because T5 learns multiple tasks at once, vs. 1 task for BERT
  • advantage of denoising loss decreases with scale
  • denoising loss are less efficient than next-token prediction => largest LLMs are all decoders

Inductive bias of transformer

  • Assume discrete inputs (tokens)
  • All positions in sentence have equal importance
  • Relates tokens based on similarities / content
  • 2 major composants with different roles:
    • self-att focuses on relations
    • MLP inject knowledge
  • Solve limitations of previous models
    • No bottleneck of information (as in ConvNet, RNN, seq2seq…)
    • No preference for “recent tokens” (as in RNN)
    • No constraints of locality (as in CNN)
      • … but “lost in the middle” effect
  • after learning, transformer can be viewed as a semi-Turing Machine:
  • proof that transformer learns in context with gradient descent
  • another proof that it can apply temporal difference (RL algo)
  • deep learning models learn algorithms to compress information
  • conclusion: 2 reasoning paths:
    • depth = nb of layers
    • time = stack again above the grown sequence

Update Oct 2024: Transformers Learn Higher-Order Optimization Methods for In-Context Learning - They learn exponentially better learning algorithms than SGD, apparently similar to Iterative Newton’s method

Desirable properties

  • Can scale
    • more layers => capture more information
  • Can “absorb” huge datasets
    • store information == same as database?
  • Can “compress” information
    • much better than database!
  • Transformers progressively replaced other models in many modalites:
    • Image: token = small piece of image
    • Audio: token = small segment of sound
    • Video, code, DNA…

  • Language models: a special place
    • “Absorb” the written web == all human knowledge
    • by far the largest transformers
    • 2021: Wu Dao 2.0: 1.75 trillion parameters
    • Central wrt other modalities (see multimodality)

Terminology

  • Activations: output of each layer
  • Embeddings: outputs of the embeddings layer
  • Latent representations: all activations
  • LM head: final linear layer that outputs 1 score/voc unit

LLM

Life cycle of LLM

Open-source community

  • Extremely important for LLMs:
    • “We have no moat” (Google, 2023)
    • Main contributors in: pretraining, finetuning, model merging, dissemination, efficiency, evaluation

  • Main open-source actors:
    • Companies: Meta, HuggingFace, Eleuther AI, Mistral, Together AI, Cerebras…
    • Civil society: passionates, geeks (TheBloke, Teknium…)
    • Academics
  • Online: Huggingface-hub, discord
  • Towards specialization:
    • Foundation: Meta, Eleuther, Mistral…
    • Prompting: CoT, PoT, AoT…
    • Finetuning: >600k models on HF
    • Integrators: LangChain, DSPy, Coala…
    • Academics: theory, app domains…
  • Conversely to code, there’s a continuum between “fully open” and “closed” source:
    • distribute model weights
    • distribute code
    • distribute training logs
    • distribute training data

LLM and AI-Act

Bloom: the first open-source LLM

  • Bloom training led by T. Le Scao & A. Fan (PhDs in Synalp)
  • Our participation to FR-MedQA (DEFT, June 2023):
    • (ZSL) qBloomZ 27.9%
    • (ZSL) Llama-65b 21.8%

BloomChat (Together.AI 2023)

Microsoft study (Nov. 2023)

Wrap-up

  • LLMs are novel tools to
    • interact naturally with humans
    • access world knowledge + common-sense
    • reason, plan, interact with code
  • They will seamlessly integrate most software
    • as modules within high-level programs
    • as data processors + generators
    • as cheap substitutes to humans in boring tasks

Practice: LLM 1

Objectives:

  • Intro to 2 libraries: transformers library and ollama
  • Analyzes an LLM to map course concepts onto transformers library
  • Advanced use: function calling with ollama

Tutorial on function calling

Analyzing an LLM

  • load the smallest qwen2.5 with Huggingface transformers with mod=AutoModel.from_pretrained(…)
  • look at its structure with the help of “for n,p in mod.named_parameters()”
  • another option is “for n,p in mod.named_modules()”
  • Describe the model: How many layers? Hidden dimension size? Types of normalizations?…
  • Hint: “type()” gives you the full name of a variable; you can then view its source code in your conda/pip env

Pretraining scaling laws

Chinchilla law

Scaling LLMs

  • The more data you train on
    • the more the LLM knows about
    • the better the LLM generalizes
  • scaling law = power law = \(y(x) = ax^{-\gamma} +b\)
  • \(y(x) =\) test loss
  • \(\gamma\) = slope

Baidu paper 2017

Scaling laws for Neural LM 2020

Open-AI 2020

  • RL, protein, chemistry…

Chinchilla paper 2022

  • GPT3 2020: inc. model capacity
  • Chinchilla 2022: inc. data

\(L=\) pretraining loss

Google 2022: paper1, paper2 Flops, Upstream (pretraining), Downstream (acc on 17 tasks), Params

Emerging capabilities

  • Scaling laws exist in Machine Learning for a long time (cf. Paper on learning curves)
  • But it’s the first time they result from emerging capabilities!

GPT3 paper 2020

  • emergence of “In-Context Learning”
    • = capacity to generalize the examples in the prompt
  • example:
"eat" becomes "ate"
"draw" becomes "drew"
"vote" becomes

Anthropic paper 2022

  • shows that the scaling law results from combination of emerging capabilities

Jason Wei has exhibited 137 emerging capabilities:

  • In-Context Learning, Chain-of-thought prompting
  • PoT, AoT, Analogical prompting
  • procedural instructions, anagrams
  • modular arithmetics, simple maths problems
  • logical deduction, analytical deduction
  • physical intuition, theory of mind ?

Phase transitions during training

During training, LLMs may abruptly reorganize their latent representation space

Grokking exhibits structured latent space

LLM traning dynamics

  • High-dim training is not a “continuous”/regular process:
    • Phase transitions first observed in 2015 in ConvNets
    • Later studied in LLMs
  • Traditional ML precepts invalidated:
    • Overfitting is not bad
    • Finding global optimum is not required

Definitions

  • Loss = LLM error function to be minimized during training
  • Loss landscape = surface/manifold of the loss \(L(\theta)\) as a function of LLM parameters \(\theta\)
  • SGD = training algorithm that iteratively updates \(\theta\) following the direction of the gradient \(-\nabla_{\theta} L\) to progressively reduce the error \(L(\theta)\) made by the LLM on the training corpus
  • Objective of training = finding a good \(\theta^*\) that gives the smallest possible \(L(\theta)\)
    • This can be viewed as navigating on the loss landscape to find a minimum/valley in the loss landscape
  • Overfitting = “learning by heart” the training dataset
    • = finding an optimal \(\theta^*\) that gives a minimum \(L(\theta)\) on the training corpus, but a large loss on another corpus
    • The real objective of training is to find a \(\theta^*\) that generalizes, i.e., that has a low error on other datasets than the training corpus
  • Regularization = Ways to prevent overfitting and still get generalization
    • reduce nb of parameters, minimize \(||\theta||^2\), dropout, reduce batch size, add noise to data…

Tishby’s bottleneck

Tishby, 2015:

  • deep learning creates 2 training phases: overfitting, then compression

Double descent

  • First proof that overfitting may be addressed by increasing the nb of parameters!

  • Belkin at al., 2018:

  • Double descent occurs when:
    • the model is over-parameterized
    • there’s a strong regularization (e.g., L2)
  • Intuition:
    • With more parameters, many optima exists
    • With the help of regularization, the model may find a really good one

Warning

  • grokking is a phase transition that occurs during training
  • double descent is not a transition of phase that occurs during training!
    • double descent helps to understand the necessary conditions for grokking

Back to grokking

  • First observed by Google in Jan 2022
    • Trained on arithmetic tasks
    • Must wait well past the point of overfitting to generalize
  • It often occurs “at the onset of Slingshots” Apple paper
    • Slingshots = type of weak regularization = when “numerical underflow errors in calculating training losses create anomalous gradients which adaptive gradient optimizers like Adam propagate” blog
    • post-grokking solutions = flatter minima
  • Training involves multiple phases: grokking is one of them MIT paper
    • generalization, grokking, memorization and confusion
  • When representations becomes structured, then generalization occurs:

  • The 4 phases depend on hyper-parameters:

  • Why do we observe phases during training?
    • NYU paper 2023
    • Because of competing sub-networks: dense for memorization and another sparse for generalization

  • COLM’24 paper: dual view of double descent and grokking, and training regimes

Wrap-up: scaling parameters

  • It’s theoretically better to have over-parameterized models, as it unlocks double descent and potentially grokking
  • But Kaplan’s law were promoting too large LLMs, Chinchilla’s law make it more reasonable
  • Scaling laws are precious tools to design experiments before starting training
  • We know since 2015 that training deep models involve multiple phases
  • We now know that LLMs have 4 possible phases, depending on the hyper-parameters
  • We know these phase transitions come from competing sub-networks (impossible without over-parameterization)

Practice: scaling laws

Objectives: Create your own scaling law

Tutorial

Design of LLM

Choice of architecture

  • Focus on representations: RAG, multimodal, recommendation
    • contrastive learning: S-BERT, BGE, E5
  • Classification: BERT? More and more: GPT
  • Focus on generation:
    • Transformer-based (GPT family)
    • MoE
    • SSM: S4, Mamba
    • Diffusion

Mixture of Experts

(from huggingface)

Mixture of Experts

  • Main advantage: reduced cost
    • Ex: 24GB VRAM (4-bit Mixtral-8x7b) 7b-params during forward pass
  • Drawbacks:

State-Space Models

  • Discrete SSM equation similar to RNN:

\[h_t = Ah_{t-1} + Bx_t\] \[y_t = Ch_t + D x_t\]

  • but A,B,C,D are computed: they depend on the context

Mamba

Mamba-minimal implementation

Ex: Mamba

  • scaling law comparable to transformer (up to \(10^{20}\) PFLOPs)
  • linear \(O(n)\) in context length
  • faster inference, but slower training

Diffusion transformer

  • Diffusion models: forward noise addition
    • Adds noise, step by step to input \[q(x_t|x_{t-1}) = N(x_t;\sqrt{1-\beta_t} x_{t-1}, \beta_t I)\]
  • backward denoising process
    • starts from white noise
    • reverse the forward process with a model (U-Net or transformer): \[p_\theta(x_{t-1}|x_t) = N(x_{t-1}, \mu_\theta(x_t,t), \Sigma_\theta(x_t,t))\]
    • iteratively denoise

Many text diffusion models

(from PMC24)

Comparison

  • Diffusion models are much slower than LLM, because of many sampling steps
  • But they produce a complete sentence at each step

Case study: Llama3.1

  • Main constraint = training compute = \(3.8\times 10^{25}\) FLOPs
  • Chinchilla scaling laws not precise enough at that scale, and not for end-task accuracy
  • so they recompute they own scaling laws on ARC challenge benchmark!
  • They fix compute \(c\) (from \(10^{18}\) to \(10^{22}\) FLOPs)
  • They pick a model size \(s\), deduce from \((s,c)\) the nb of data points \(d\) the model can be trained on
  • They train it, get the dev loss \(l(s,c,d)\)
  • They plot the point \((x=d,y=l)\) and iterate for a few other \(s\)
    • For each \(c\), they get a quadratic curve: why?
  • Because:
    • the largest model is more costly to train, so it’s trained on the fewest data (\(x_{min}\))
    • when it’s too large, it’s trained on too few data \(\rightarrow\) high loss
    • the smallest model is trained on lots of data (\(x_{max}\))
    • but it has too few parameters to memorize information \(\rightarrow\) high loss
    • so there’s a best compromise in between

  • Terms to remind:
    • compute = nb of operations to train
    • IsoFLOP curve = plot obtained at a constant compute
    • compute-optimal model = minimum of an IsoFLOP curve
  • But these scaling laws are mostly computed at “low” FLOPs range
  • So they compute for these models & the largest llama2 models both the dev loss and accuracy
  • Assume accuracy = sigmoid(test loss)
  • Extrapolate the scaling law with this relation, and get 405b parameters

Pretraining

  • After data processing, the most important step:
    • store all knowledge into the LLM
    • develop emergent capabilities
  • Most difficult step

Principle:

  • Train model on very large textual datasets
    • ThePile
    • C4
    • RefinedWeb

Which tasks for training?

  • predict next token (GPT)
  • predict masked token (BERT)
  • next sentence prediction (BERT)
  • denoising texts (BART)
  • pool of “standard” NLP tasks (T5, T0pp):
    • NLI, sentiment, MT, summarization, QA…

Multilingual models:

  • old models:
    • XLM-R
    • trained on 100 languages
    • Bloom, M-BERT, M-GPT…
  • new models:
    • Qwen2: 29 languages
    • all LLMs!

Training

  • Next token prediction

Training algorithm: SGD

  • Initialize parameters \(\theta\) of LLM \(f_\theta()\)
  • Stochastic Gradient Descent:
    • Sample 1 training example \((x_{1\dots T-1},y_T)\)
    • Compute predicted token \(\hat y_T = f_\theta(x_{1\dots T-1})\)
    • Compute cross-entropy loss \(l(\hat y_T, y_T)\)
    • Compute gradient \(\nabla_\theta l(\hat y_T, y_T)\)
    • Update \(\theta' \leftarrow \theta -\epsilon \nabla_\theta l(\hat y_T, y_T)\)
    • Iterate with next training sample

Compute gradient: back-propagation

  • Compute final gradient \(\frac{\partial l(\hat y_T, y_T)}{\partial \theta_{L}}\)
  • Iterate backward:
    • Given “output” gradient \(\frac{\partial l(\hat y_T, y_T)}{\partial \theta_{i+1}}\)
    • Compute “input” gradient: \(\frac{\partial l(\hat y_T, y_T)}{\partial \theta_{i}} = \frac{\partial l(\hat y_T, y_T)}{\partial \theta_{i+1}} \frac{\partial \theta_{i+1}}{\partial \theta_i}\)
    • every operator in pytorch/tensorflow is equipped with its local derivative \(\frac{\partial \theta_{i+1}}{\partial \theta_i}\)

Key ingredients to success:

  • The prediction task is forcing the model to learn every linguistic level
  • The model must be able to attend long sequences
    • Impossible with ngrams
  • The model must have enough parameters
    • Impossible without modern hardware
  • The model must support a wide range of functions
    • Impossible before neural networks
  • The data source must be huge

LLMs store knowledge

  • “In order to cook bacon, you…”
    • “place the bacon in a large skillet and cook over medium heat until crisp.”
  • “The main characters in Shakespeare’s play Richard III are…”
    • “Richard of Gloucester, his brother Edward, and his nephew, John of Gaunt”

LLMs “reason”

  • “On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book. The red book is to the right of the gray book. The black book is to the left of the blue book. The blue book is to the left of the gray book. The purple book is the second from the right. Which book is the leftmost book?”
    • “The black book”
  • “Reasoning” capacity acquired by next token prediction:
    • famous Ilya Sutskever’s example: imagine a detective book that distills cues and hints all along the story and finish with: “Now, it is clear that the culprit is…”
  • “Reasoning” of an LLM:
    • Remember algorithms that generate training samples
    • Match/combine subgraphs

  • Pretraining wrap-up:
    • “next token prediction” objective \(\rightarrow\) knowledge storing + reasoning
    • Simple training algorithm to do that: SGD with backprop
    • But needs training at scale
  • Training at scale is very hard:
    • trillions training steps \(\rightarrow\) GPUs are mandatory (Llama3.1 trained on 16k A100 simultaneously)
    • billions parameters \(\rightarrow\) need to shard data + LLM across GPUs
    • data/model/compute Parallelization is key

Training at scale

Data P., model sharding, Tensor P., Sequence P., pipeline P…

  • Tooling:
    • Megatron-DS
    • Deep Speed
    • Accelerate
    • Axolotl
    • Llama-factory
    • GPT-NeoX

Practice: training

Objectives:

  • Manual training loop with transformers, pytorch Dataset and DataLoader
  • Understanding batching in practice
  • Track the training curves
  • Download the following code, which is a draft of a code to train a distilGPT2 with batchsize>1: train.py
  • This code is buggy!
  • Debug it to make it run with batchsize=1
  • Then debug it to make it run with batchsize=4
  • In both cases, track the training loss and compares

Local usage

  • With Zero-Shot learning
  • With few-shot learning
  • As frozen embeddings
  • With fine-tuning
  • With prompt tuning
  • With Zero-label Unsupervised Data Generation

Zero-Shot Learning (teaser)

  • Zero-Shot because the LLM has been trained on task A (next word prediction) and is used on task B (sentiment analysis) without showing it any example.

In-context Few-Shot Learning

  • requires very few annotated data
  • In context because examples are shown as input
  • does not modify parameters
  • see paper “Towards Zero-Label Language Learning” (GoogleAI, sep 2021)

As frozen embeddings

  • does not modify the LLM parameters
  • for ML-devs, as the LLM is part of a larger ML architecture

With fine-tuning

  • does modify the LLM parameters !
  • often give the best results
  • Difference between finetuning and continued pretraining (continual learning, CL):
    • FT: adapt the LLM to a domain, a language, a task…
    • FT: the LLM will loose other capacities
    • CL: inject new knowledge into the LLM
    • CL: capture the language drift
    • CL: the LLM stays generic, and can still be FT on many tasks
  • Warning with Fine-tuning:
    • the resulting model gets specialized
    • forgetting: lose its previous knowledge
    • overfitting: cannot generalize any more
  • Solutions:
    • regularization, adapters, rehearsal…

With prompt-tuning

Zero-label Unsupervised Data Generation

  • Generate synthetic samples with few-shot (just task desc. and samples, no labels, plus ask for sample for given label)
  • Finetune a model on this dataset

Manual prompting

  • When used in ZSL, we want the model to perform a given task with a given input
  • Ex: “translate the following sentence in French: the weather is nice”
  • “translate the following sentence in French” == prompt
  • “the weather is nice” == input
  • Designing a good prompt is an art:

  • “Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy”
    • “Positive”
  • “A is the son’s of B’s uncle. What is the family relationship between A and B?”
    • “cousins”
  • “A is the son’s of B’s uncle. What is B for A?”
    • “brother”
  • A good prompt:
    • makes the model focus on a specific task/request
    • gives contextual information
    • enables to control the output
  • ex: “summarize the previous text with simple language so that five-years old children may understand it”

Prompt programming

  • is the art of designing prompt to perform a task
  • Prompting may be viewed as a way to constraint the generation
    • You may describe the task
    • You may give examples (in-context few-shots)
    • You may give an imaginary context to “style” the result
  • How to describe the task:
    • direct task description
    • proxy task description
  • Direct task description:
    • “translate French to English”
  • Can be contextual:
    • “French: … English: …”
  • Direct description can combine tasks the model must know:
    • “rephrase this paragraph so that a 2nd grade can understand it, emphasizing real-world applications”
  • Proxy task description

This is a novel written in the style of J.R.R. Tolkien’s Lord of the Rings fantasy novel trilogy. It is a parody of the following passage:

“S. Jane Morland was born in Shoreditch …”

Tolkien rewrote the previous passage in a high-fantasy style, keeping the same meaning but making it sound like he wrote it as a fantasy; his parody follows:

  • Few-shot prompts:

English: Writing about language models is fun. Roish: Writingro aboutro languagero modelsro isro funro. English: The weather is lovely! Roish:

  • “chain of thoughts”:
    • decompose a difficult task into steps
    • only works with large models (>100GB)
    • Applied to solve 2nd grade math problems

CoT requires large models:

  • Self-consistency greatly improves CoT prompting
  • For one (CoT prompts, question) input, sample multiple outputs
  • take majority vote among outputs

Analogy solving:

Directions: In the following question, a related pair of
words or phrases is followed by five pairs of words or
phrases. Choose the pair that best expresses a relationship
similar to that in the original pair.

braggart :: modesty

A) fledgling : experience

B) embezzler : greed

C) wallflower : timidity

D) invalid : malady

E) candidate : ambition

To solve this problem, first we need to understand the
relationship that exists between braggart and modesty.
According to the sentence, 'braggart' is a person who talks
too much about himself or herself and is usually not
believed. On the other hand, 'modesty' is the opposite of
this and denotes a person who does not talk too much about
himself or herself. Thus, for 'modesty' to be a suitable
answer, it should show the opposite of 'braggart'.
Now let's see whether each pair expresses a relationship
similar to that between braggart and modesty.

Next we have 'fledgling', which means a person who is
inexperienced and 'experience' which means knowledge gained
through practical involvement. Thus, 'fledgling' is a person
who has no experience and 'experience' is knowledge gained
through practical involvement. Thus, 'fledgling' is the
opposite of 'experience'. The relationship between these two
words is similar to that between braggart and modesty, hence
'fledgling' is the answer.

Adaptation and carbon costs

  • Pre-training costs:
    • Llama2-70b = $5M
    • $1M for training a 13b-LLM on 1.4T tokens
  • How to reduce these costs?
    • Break the scaling laws!
      • Hard to beat the transformer
      • Cleaning data?
      • Continual learning?
  • Finetuning costs: $10 for finetuning 6b-LLM

Context Length

  • vanishing gradient in RNN
  • memory cost
  • methods:
    • sliding attention window
    • moving avg equipped gated attention (megalodon)
    • compressing context
    • hierarchical context HOMER
    • external memory InfLLM

InfLLM

  • Main issues with long context:
    • train/test mismatch, as LLMs have mostly been trained on small context size
    • long context contains a lot of “noise” (not relevant text)
  • Proposes to combine:
    • sliding attention window = use only local tokens in context
    • external memory

InfLLM algo

1- chunk the long seq, encode each chunk independently 2- for each token to generate: - the long-context input is composed of (long) past KV-cache + current tokens - the KV-cache is composed of: - (small) initial sequence is kept - (long) evicted sequence - evicted KV are stored in external memory - at test time, a lookup f() selects KV from external memory to add to the small context

  • the past KV are chunked into blocks, look-up is done at the block level.
  • each block is represented by r “best representative tokens”
  • representative score of token m in a block = avg q(m+j) k(m)
  • lookup is based on another (similar) relevance score btw q(X) and the representative tokens of a block.
  • all input tokens beyond X have the same positional encoding.

Continual Learning

  • Challenges:
    • Catastrophic forgetting
    • Evaluation: cf. realtime-QA
    • Data drift towards LLM-generated texts
    • How to restart training?

Catastrophic forgetting

There’s some hope though…

Continual training

  • Sheared Llama (Princeton, 2023)
    • structured pruning + cont. training w/ batch weighting

  • Sheared llama: resurgence of a scaling law w/ cont. training

  • Carbon footprint:
    • many researches on reducing costs
    • also on tackling climate change with AI
  • Privacy:
    • models remember facts (cf. Carlini)
    • user modeling
  • Capture & remember long-term context
  • Limits of “pure text”: multimodal, grounded

Carbon footprint

  • Costs of LLMs:
    • training on GPU clusters
    • usage requires powerful machines
  • Many ways to reduce cost:
    • algorithms improvement
    • developing “heritage” (soft + hard)
    • hardware improvement
    • using LLM to reduce impact of other activies
      • e.g., less bugs, shorten dev cycles, reduce waste…

Algorithms improvement

  • Fast progress:
    • Training GPT3 costed 3M\$ in 2020, now “only” 150k\$
  • Quantization
    • reducing nb of bits per parameter
    • standard: 32 (cpu), 16 & bf16 (gpu)
    • quantized: 8, prospect: 4, 2, 1
    • cf. bitsandbytes, ZeroQuant, SmoothQuant, LLM.int()…
    • GLM-130B requires VRAM: 550GB (32b), 300GB(16b), 150GB (8b)
  • Pruning
    • Principle: remove some neurons and/or connections
    • Example:
      • remove all connections which weight is close to 0
      • then finetune on target task, and iterate
    • Hard to do on large LM
  • Many pruning methods:
    • data-free criteria:
      • magnitude pruning
    • data-driven criteria:
      • movement pruning
    • post-training pruning
    • pruning with retraining
    • structured pruning
    • unstructured pruning (sparsity)
  • distillation
    • Principle: train a small student model to reproduce the outputs of a large teacher model
  • Problems:
    • Limited by the (usually) small corpus used: does it generalize well?
    • Otherwise, very costly: why not training from scratch?
  • Parameter-efficient finetuning:
    • Principle: fine-tune only a small number of parameters
    • Adapters
    • Prompt tuning
  • Loss: UL2

  • MAGMAX: Leveraging Model Merging for Seamless Continual Learning

  • Forgetting by finetuning does not work: LLMs don’t forget internally: https://arxiv.org/abs/2409.02228

Heritage & low-end computers

  • offloading
    • see deepspeed Zero
    • see accelerate
  • Collaborative training:
    • TogetherComputer: GPT-JT
    • PETALS & Hivemind
  • Mixture of experts
  • Challenge: privacy
    • Models remember facts (cf. Carlini)
  • Hard to solve ?
    • Differential Privacy: impact usefulness
    • Post-edit model
      • find private info + delete it from model
    • Pre-clean data
  • Challenge: frugal AI
    • Vision, Signal processing with small models
    • Pruning, distillation, SAM, firefly NN…
  • Impossible? in NLP
    • Needs to store huge amount of information
    • Federated model ?
    • Continual learning ?
    • Sub-model editing ? (“FROTE”)
  • Capturing longer contexts
    • model structured state spaces (Annotated S4 on github)
  • Limit of text accessibility ?
    • Grounding language: vision, haptic, games…
    • Annotate other tasks with language:
      • industrial data, maths proof…

Parameter-Efficient Training (PEFT)

Principle

  • do not finetune all of the LLM parameters
    • because not enough data (?)
    • because too costly (!)
  • finetune/train a small number of (additional) parameters

  • Remarks:
    • full finetuning works usually slightly better
    • Training-free alternatives: prompting, RAG, LLM agents…

Finetuning on top

  • Add a linear layer on top (classification head), and trains it
  • You may further backprop gradients deeper in the top-N LLM layers
  • … Or just FT the top-N layers without any additional parameters

  • Simple, old-school, it usually works well
  • View the LLM as a features provider

PEFT methods

  • adapters, prefix tuning, soft prompts, P-tuning, LoRA, DoRA, Galore, Welore…
  • Ladder Side Networks, activation engineering

Advantages

  • greatly improve performances on a target task, language, domain
  • dig knowledge up to the surface, ready to use
  • give the LLM desirable capacities: instruction-following, aligned with human preferences…

Drawbacks

  • forgetting
  • very slow to learn (1 bit)
  • increase hallucinations

VRAM usage

Method Bits 7B 13B 30B 70B 110B 8x7B 8x22B
Full AMP 120GB 240GB 600GB 1200GB 2000GB 900GB 2400GB
Full 16 60GB 120GB 300GB 600GB 900GB 400GB 1200GB
LoRA/GaLore/BAdam 16 16GB 32GB 64GB 160GB 240GB 120GB 320GB
QLoRA 8 10GB 20GB 40GB 80GB 140GB 60GB 160GB
QLoRA 4 6GB 12GB 24GB 48GB 72GB 30GB 96GB
QLoRA 2 4GB 8GB 16GB 24GB 48GB 18GB 48GB

When to use PEFT?

Method data size notes
Pretraining >10T Full training
Finetuning 1k … 1b Adapt to task: PEFT
Continual learning 1k … 1b update the LLM: PEFT?
Few-Shot learning < 1k Guide, help the LLM
  • Choose PEFT when constrained by available hardware:
    • single GPU with VRAM<40GB, LLM larger than 1b –> PEFT!
  • example: Adaptation to French
    • full-finetune of 7b LLM on 1x 80GB-A100: CLAIRE
    • CroissantLLM…

What can be done without any GPU?

  • Running LLM is OK: see llama.cpp, llamafile, ollama…
  • Avoid finetuning at all costs!
    • llama.cpp: qLoRA supported, but not mature
    • too slow to be usable

Wrap-up

  • With enough compute, prefer full-finetuning
    • HF transformer, deepspeed, llama-factory, axolotl
  • With 1 “small” GPU, go for PEFT (qLoRA)
  • Without any GPU: look for alternatives (ICL, RAG…)

best practices

  • tools to use
  • common configurations, hyper-parameters

References

  • Great pedagogical point of view about LLM by Sasha Rush: video

  • calcul des Flops: https://kipp.ly/transformer-inference-arithmetic/