wp0

LLM4All

Kickoff meeting

10h00	WP0 + intro (LORIA)
10h30	WP1: finetuning (LORIA)
11h00	WP2: low-cost LLM (LIX)
14h00	WP3: LLMs for spoken dialogues (Linagora)
14h30	WP4: LLM+other data (APHP)
15h00	WP5: communication (Lionagora)
15h30	Session data: (Linagora + APHP)
16h15	Setup agenda and next meetings
16h45	Misc and wrapup

WP0

Participants: all
Website
PMT
Advisory board
Data Management Plan

Website

https://ia.loria.fr/llm4all
Intranet: (see password in email from Sep 8th 2023 11:41 )
- documents
gitlab: https://gitlab.inria.fr/synalp/llm4all
- minutes, sources, website…
mailing list: llm4all@inria.fr
Suggestions? (mastodon?…)

Project Management Team

PMT
- 1 person/partner + WP leader
- shall meet every month: date / time?
Advisory board
- PMT + invited external experts + ANR
- shall meet every year

First Deliverables

T0+6: D0.2: Data Management Plan
List all data produced/consumed:
- datasets, code, publications, internal & external reports, deliverables, website, blogs…
- public/private, licence, diffusion, where stored, security, when deleted, how long-term support
Website OPIDOR? (painful!); Latex template!

WP1 “Finetuning”

How to finetune
Continual learning

Overview: LLMs

What is a Large Language Model?

An LLM is a transformer that transforms texts into a representation (embedding), and predicts the next word
Transformer invented by Google in 2017:
- No bottleneck of information (compared to previous models)
- It scales!

Scaling property of LLMs

scaling = if you add parameters, it can store more information
controlled by measuring its performances on tasks
scaling law = power law = \(y(x) = ax^{-\gamma} +b\)
metric = test loss
\(\gamma\) = slope

Baidu paper 2017

GPT3-175b paper 2020

emergence of In-Context Learning !
- provide examples of a task in the context, the output mimics the examples

Scaling laws for Neural LM 2020

RL, protein, chemistry…

Scaling exist since long in ML Paper on learning curves
But reducing test loss is linked to emergent abilities in transformers
- Such emergence never been observed in ML before

Chinchilla paper 2022

GPT3: train on 300b tokens, and scale parameters up
given fixed FLOPS, optimal balance btw dataset / parameters
Lesson: need more data!

\(L(N)=\frac A {N^\alpha}\)

\(N\) = dataset size
\(\alpha \simeq 0.5\) (was 0.05 in GPT3 paper)

Any really useful enhancement of transformers since 2017?
- FlashAttention
- Pre-layer norm
- Parallel feed-forward and attention
- Rotary + Alibi positional encodings

Anthropic paper 2022

smooth scaling results from combination of abrupt emergences

Open-source LLM community

Prompters
Finetuners
Trainers (Eleuther, Meta, Anthropic, Mistral…)
Integrators (LangChain, Coala…)
Theoreticians (academics)

Finetuning

Catastrophic forgetting is linear! paper

LR scheduler
- pretraining: start with large LR (big jumps), then decrease
- finetuning starts from a “deep” optimum
- redo big jumps? Will forget the previous optimum
- continue small LR? Will not learn new optimum

Limiting forgetting

rehearsal
regularization from initial model
growing networks