WP2 - Appendix

Christophe Cerisara

CNRS, LORIA, Synalp team

Compromise: scaling laws vs. cost

Scaling is always better
SmolLMs are not Chinchilla-optimal but improve accessibility
Approaches to reduce LLM size and cost:
- Optimization / Quantization / Algorithmic
- Distillation
- Pruning / Low-rank compression

Pruning: motivations

May information stored in LLMs be sparse?
- Despite the fact that LLMs are trained on >10T words…
- In many target applications, a lot of knowledge is not required

\(\rightarrow\) generic vs. specific LLM

Pruning generic LLMs

Can we prune a pretrained LLM so that it’s still generic?
Is there “room” for pruning, given all the knowledge the LLM has memorized?
\(\rightarrow\) Study the rank of parameters
- LoRA: finetune within a lower dimensional space
- GaLore: gradients low-rank projection
- LORD: low-rank decomposition of weight matrices
- CALDERA: joint quantization & low-rank projection
- …

Pruning and merging multiple specific LLMs

Lottery Ticket Hypothesis:
- Each neural network contains a sub-network (winning ticket) that, if trained again in isolation, matches the performance of the full model.

Advantages:
- Can remove 90% parameters nearly without loss in performances (on image tasks)
Drawbacks:
- Impossible to find the winning mask without training first the large model
- (Only?) for specialized models

can be applied to sparse FT
- FT an LLM on specific task/lang
- extract the mask = params that change most
- rewind the LLM and re-FT with mask
- sparse finetunes can be combined without overlapping!

Approximations of activations: \[\widehat{\Delta W} = \underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x \in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
Optimal solution: PCA of covariance: \[\Sigma = \underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] - \mathbb{E}[y]\mathbb{E}[y]^T\]
Used for LSTM & BERT in (Chen,2021), for transformers in (Yu,2023), and see

Proposal 1: Generalize to non-linear layers:
- The minimization objective can be viewed as Feature Distillation
- Replace SVD with gradient descent
- Teacher module \(\mathcal{T}^{(i)}(X; \Theta^{(i)})\) and Student module \(\mathcal{S}^{(i)}(X; \Delta \Theta^{(i)})\)

\[\widehat{\Delta \Theta^{(i)}} = \underset{{\Delta \Theta^{(i)}}}{\mathrm{argmin}} \;\; \mathcal{L}^{(i)}(Y^{(i)}, \; \widehat{Y}^{(i)})\]

Because of unstabilities, augment L1 loss as (Chang,2022):

\[\mathcal{L}^{(i)} = \sum_{t=1}^{b} \left[ \frac{1}{D} \left\| Y^{(i)}_{t} - \widehat{Y}^{(i)}_{t} \right\|_1 - \log \sigma \left( \cos \left( Y^{(i)}_{t}, \widehat{Y}^{(i)}_{t} \right) \right) \right]\]

Trained with SGD
Converges towards SVD solution in the linear case

Proposal 3: Module-level distillation
- Support of non-linear layers \(\rightarrow\) any stack of layers
- Better compromise between atomic/local distillation (matrix level) and global distillation (transformer level)
- We experiments at layer-level