Low-rank compression of LLM

Christophe Cerisara

CNRS, LORIA, Synalp team

LLM brief overview

LLM: post-Machine Learning area?

  • Limitations of ML approaches:
    • Only understand vector inputs
    • Unable to learn from 2 examples
  • LLMs solve these limitations:
    • Understand English
    • Thanks to “reasoning”, can learn from 2 examples

Ex: last letter concatenation

(from Denny Zhou, Google)

Elon Musk nk
Bill Gates ls
  • Obvious for humans with 2 examples
  • ML approach:
    • enc-dec trained on tons of labeled data
  • Qwen2.5-7b:

Perform last letter concatenation, as shown in these two examples. Words: Elon Musk Answer: nk Words: Bill Gates Answer: ls Words: Barack Obama Answer:

[…] So, the concatenation would be ka

  • Requires more advanced prompting strategies with older models: CoT, analogical prompting…

Keys to success

  • GPU
  • Data
  • Transformer
    • No data flow bottleneck

2017: the transformer

  • Reason over layer steps
  • Semi-Turing machine
  • Learns to learn (2nd order-GD, TD)
  • Reason over time steps

Transformer scaling laws

  • The more data you train on
    • the more the LLM knows about
    • the better the LLM generalizes
  • scaling law = power law = \(y(x) = ax^{-\gamma} +b\)
  • \(y(x) =\) test loss
  • \(\gamma\) = slope

Baidu paper 2017

Chinchilla scaling law 2022

\(L=\) pretraining loss

Google 2022: paper1, paper2 Flops, Upstream (pretraining), Downstream (acc on 17 tasks), Params

Scaling laws and pruning

  • Recent models are not Chinchilla-optimal
  • SmolLMs improve accessibility
  • Quantization impacts scaling laws

“Scaling Laws for Precision” (Nov, 2024)

LLM pruning

  • Reducing LLM size and cost:
    • Quantization
    • Distillation
    • Pruning
    • Low-rank compression

Motivations

  • In many target applications, a lot of knowledge is not required
  • May information stored in LLMs be sparse?
    • Despite the fact that LLMs are trained on >10T words…
  • Lottery Ticket Hypothesis:
    • Each neural network contains a sub-network (winning ticket) that, if trained again in isolation, matches the performance of the full model.
  • Advantages:
    • Can remove 90% parameters nearly without loss in performances (on image tasks)
  • Drawbacks:
    • Impossible to find the winning mask without training first the large model
  • can be applied to sparse FT

  • FT an LLM on specific task/lang

  • extract the mask = params that change most

  • rewind the LLM and re-FT with mask

  • sparse finetunes can be combined without overlapping!

  • Low-rank approaches with LLMs:
    • LoRA: finetune within a lower dimensional space
    • GaLore: gradients low-rank projection
    • LORD: low-rank decomposition of weight matrices
    • CALDERA: joint quantization & low-rank projection

Pruning generic LLMs

  • Can we prune a pretrained generic LLM so that it’s still generic?
  • Is there “room” for pruning, given all the knowledge the LLM has memorized?
  • \(\rightarrow\) Study the rank of parameters

Low-rank matrices

  • Weight matrices in LLMs are “slightly low-rank”, or “not totally full-rank”
  • Activations are low-rank!
  • Approximations of activations: \[\widehat{\Delta W} = \underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x \in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
  • Optimal solution: PCA of covariance: \[\Sigma = \underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] - \mathbb{E}[y]\mathbb{E}[y]^T\]
  • Used for LSTM & BERT in (Chen,2021), for transformers in (Yu,2023), and see LORD in (Kaushal,2023)
  • Limitations of activations approx:
    • Only linear layers
    • Only “Teacher” distillation (aka Atomic feature distillation in (Yu,2023))
    • Only local distillation

Next: work of Yaya Sy, Ph.D. student in Synalp

  • Proposal 1: Generalize to non-linear layers:
    • The minimization objective can be viewed as Feature Distillation
    • Replace SVD with gradient descent
    • Teacher module \(\mathcal{T}^{(i)}(X; \Theta^{(i)})\) and Student module \(\mathcal{S}^{(i)}(X; \Delta \Theta^{(i)})\)

\[\widehat{\Delta \Theta^{(i)}} = \underset{{\Delta \Theta^{(i)}}}{\mathrm{argmin}} \;\; \mathcal{L}^{(i)}(Y^{(i)}, \; \widehat{Y}^{(i)})\]

  • Because of unstabilities, augment L1 loss as (Chang,2022):

\[\mathcal{L}^{(i)} = \sum_{t=1}^{b} \left[ \frac{1}{D} \left\| Y^{(i)}_{t} - \widehat{Y}^{(i)}_{t} \right\|_1 - \log \sigma \left( \cos \left( Y^{(i)}_{t}, \widehat{Y}^{(i)}_{t} \right) \right) \right]\]

  • Trained with SGD
  • Converges towards SVD solution in the linear case
  • Proposal 2: Better teacher/student inputs compromise
  • Teacher-only: fast convergence, but mismatch btw train/test
  • Student-only: no mismatch, but slow convergence due to errors propagation
  • Teacher+Student: best compromise
  • Proposal 3: Module-level distillation
    • Support of non-linear layers \(\rightarrow\) any stack of layers
    • Better compromise between atomic/local distillation (matrix level) and global distillation (transformer level)
    • We experiments at layer-level
  • Every layer has the same rank?
  • Bottom-first compression:
    • Low-memory reqs:
      • process 1 layer at a time
      • no backprop from top needed
    • Low-cost:
      • partial forward pass
      • initialize with SVD: few calibration data needed

Results

  • Compress Mixtral-48B, Gemma-27B on only 1xA100
  • Good results with Phi3-14B, Phi2-3B, Mistral-7B
  • Mixtral-48b now fits with 2048-context & batch=4 on a single A100!
  • Compress Mamba-3B, FalconMamba-7B, Whisper-med

Conclusion

  • Low-rank is everywhere in LLM tools
  • But it’s not enough to make LLMs commodities:
    • Quantization is more efficient
    • Hardware/software optimization is key!

Happy to chat! cerisara@loria.fr, @cerisara@mastodon.online