WP2 - Appendix

Christophe Cerisara

CNRS, LORIA, Synalp team

Compromise: scaling laws vs. cost

  • Scaling is always better
  • SmolLMs are not Chinchilla-optimal but improve accessibility
  • Approaches to reduce LLM size and cost:
    • Optimization / Quantization / Algorithmic
    • Distillation
    • Pruning / Low-rank compression

Pruning: motivations

  • May information stored in LLMs be sparse?
    • Despite the fact that LLMs are trained on >10T words…
    • In many target applications, a lot of knowledge is not required

\(\rightarrow\) generic vs. specific LLM

Pruning generic LLMs

  • Can we prune a pretrained LLM so that it’s still generic?
  • Is there “room” for pruning, given all the knowledge the LLM has memorized?
  • \(\rightarrow\) Study the rank of parameters
    • LoRA: finetune within a lower dimensional space
    • GaLore: gradients low-rank projection
    • LORD: low-rank decomposition of weight matrices
    • CALDERA: joint quantization & low-rank projection

Pruning and merging multiple specific LLMs

  • Lottery Ticket Hypothesis:
    • Each neural network contains a sub-network (winning ticket) that, if trained again in isolation, matches the performance of the full model.
  • Advantages:
    • Can remove 90% parameters nearly without loss in performances (on image tasks)
  • Drawbacks:
    • Impossible to find the winning mask without training first the large model
    • (Only?) for specialized models
  • can be applied to sparse FT
    • FT an LLM on specific task/lang
    • extract the mask = params that change most
    • rewind the LLM and re-FT with mask
    • sparse finetunes can be combined without overlapping!
  • Approximations of activations: \[\widehat{\Delta W} = \underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x \in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
  • Optimal solution: PCA of covariance: \[\Sigma = \underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] - \mathbb{E}[y]\mathbb{E}[y]^T\]
  • Used for LSTM & BERT in (Chen,2021), for transformers in (Yu,2023), and see
  • Proposal 1: Generalize to non-linear layers:
    • The minimization objective can be viewed as Feature Distillation
    • Replace SVD with gradient descent
    • Teacher module \(\mathcal{T}^{(i)}(X; \Theta^{(i)})\) and Student module \(\mathcal{S}^{(i)}(X; \Delta \Theta^{(i)})\)

\[\widehat{\Delta \Theta^{(i)}} = \underset{{\Delta \Theta^{(i)}}}{\mathrm{argmin}} \;\; \mathcal{L}^{(i)}(Y^{(i)}, \; \widehat{Y}^{(i)})\]

  • Because of unstabilities, augment L1 loss as (Chang,2022):

\[\mathcal{L}^{(i)} = \sum_{t=1}^{b} \left[ \frac{1}{D} \left\| Y^{(i)}_{t} - \widehat{Y}^{(i)}_{t} \right\|_1 - \log \sigma \left( \cos \left( Y^{(i)}_{t}, \widehat{Y}^{(i)}_{t} \right) \right) \right]\]

  • Trained with SGD
  • Converges towards SVD solution in the linear case
  • Proposal 3: Module-level distillation
    • Support of non-linear layers \(\rightarrow\) any stack of layers
    • Better compromise between atomic/local distillation (matrix level) and global distillation (transformer level)
    • We experiments at layer-level