LLM4All
Kickoff meeting
10h00 |
WP0 + intro (LORIA) |
10h30 |
WP1: finetuning (LORIA) |
11h00 |
WP2: low-cost LLM (LIX) |
14h00 |
WP3: LLMs for spoken dialogues
(Linagora) |
14h30 |
WP4: LLM+other data (APHP) |
15h00 |
WP5: communication (Lionagora) |
15h30 |
Session data: (Linagora + APHP) |
16h15 |
Setup agenda and next meetings |
16h45 |
Misc and wrapup |
WP0
Participants: all
Website
PMT
Advisory board
Data Management Plan
Website
- https://ia.loria.fr/llm4all
- Intranet: (see password in email from Sep 8th 2023 11:41 )
- gitlab: https://gitlab.inria.fr/synalp/llm4all
- minutes, sources, website…
- mailing list: llm4all@inria.fr
- Suggestions? (mastodon?…)
Project Management Team
- PMT
- 1 person/partner + WP leader
- shall meet every month: date / time?
- Advisory board
- PMT + invited external experts + ANR
- shall meet every year
First Deliverables
T0+6: D0.2: Data Management Plan
List all data produced/consumed:
- datasets, code, publications, internal & external reports,
deliverables, website, blogs…
- public/private, licence, diffusion, where stored, security, when
deleted, how long-term support
Website OPIDOR? (painful!); Latex template!
WP1 “Finetuning”
- How to finetune
- Continual learning
Overview: LLMs
What is a Large Language
Model?
- An LLM is a transformer that transforms texts into a
representation (embedding), and predicts the next word
- Transformer invented by Google in 2017:
- No bottleneck of information (compared to previous models)
- It scales!
Scaling property of LLMs
- scaling = if you add parameters, it can store more information
- controlled by measuring its performances on tasks
- scaling law = power law = \(y(x) =
ax^{-\gamma} +b\)
- metric = test loss
- \(\gamma\) = slope
GPT3-175b paper
2020
- emergence of In-Context Learning !
- provide examples of a task in the context, the output mimics the
examples
- Scaling exist since long in ML Paper on learning
curves
- But reducing test loss is linked to emergent abilities in
transformers
- Such emergence never been observed in ML before
Chinchilla paper
2022
- GPT3: train on 300b tokens, and scale parameters up
- given fixed FLOPS, optimal balance btw dataset / parameters
- Lesson: need more data!
\(L(N)=\frac A {N^\alpha}\)
- \(N\) = dataset size
- \(\alpha \simeq 0.5\) (was 0.05 in
GPT3 paper)
- Any really useful enhancement of transformers since 2017?
- FlashAttention
- Pre-layer norm
- Parallel feed-forward and attention
- Rotary + Alibi positional encodings
Anthropic paper
2022
- smooth scaling results from combination of abrupt emergences
- Prompters
- Finetuners
- Trainers (Eleuther, Meta, Anthropic, Mistral…)
- Integrators (LangChain, Coala…)
- Theoreticians (academics)
Finetuning
- Catastrophic forgetting is linear! paper
- LR scheduler
- pretraining: start with large LR (big jumps), then decrease
- finetuning starts from a “deep” optimum
- redo big jumps? Will forget the previous optimum
- continue small LR? Will not learn new optimum
Limiting forgetting
- rehearsal
- regularization from initial model
- growing networks