LLM4ALL
LLM compression
cost reduction

Christophe Cerisara

CNRS, LORIA, Synalp team

Why do we want to compress LLMs?

Larger models are always better!

Bottom line: reduce inference costs

How to reduce compute at test time

  • Algorithmic optimization: adv. KV-cache, FIRP, LAMPS, speculative decoding, FlashAttention3, …, CPU kernels:
MatMul kernel GFlops
matmul on python 0.042 GFLops
numpy (FORTRAN) 29 GFlops
reimplementation of numpy in C++ 47 GFlops
BLAS with multithreading 85 GFlops
llama.cpp (focus matrix-vec) 233 GFlops
Intel’s MKL (closed source) 384 GFlops
OpenMP(512x512 matrix) 810 GFlops
exported in llamafile 790 GFlops

All these optimizations are complementary with compression!

How to reduce VRAM requirements

  • Quantization: -75% !! But is quantization hitting ceiling?

(Digression) scaling laws

Scaling laws are the best thermometer/caliper of LLMs

Scaling laws govern training LLMs:

Scaling laws govern test-time compute reasoning:

Scaling laws govern in-context learning:

Compression better than quantization wrt scaling laws?

Reducing LLM size

Compression: need calibration data

LLM Pruning: motivations

  • Is there still any free space in LLM matrices? (parameter-efficiency)
  • Otherwise, we may not need all this information at test time
  • Pruning: remove “unused” or “superfluous” dimensions
  • Metric to measure “emptiness”: matrix rank

LLM matrices are nearly full rank

But activations are low rank

  • Principle: find a low-rank matrix that minimizes reconstruction error: \[\widehat{\Delta W} = \underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x \in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
  • Solution (only for matrices): \[\Sigma = \underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] - \mathbb{E}[y]\mathbb{E}[y]^T\]
  • LORD (Kaushal,2023)
  • Our contributions:
    • Generalize to non-linear layers
      • Linear algebra \(\rightarrow\) Feature Distillation
    • Tunable compromise local vs. global optimization
      • Local \(\rightarrow\) Flexible semi-global
    • Improved distillation
      • Teacher-only \(\rightarrow\) Teacher & Student supervision
    • Low-cost algo: bottom-first compression
  • Contribution: Better teacher/student inputs compromise
  • Evidence: deeper layers are more robust to compression:
  • Bottom-first compression:
    • Low memory requirements:
      • Compress layers 1 by 1
      • No backprop
    • Low computational cost & sample-efficient:
      • Partial forward pass
      • SVD init: reduce data reqs

Work published at NAACL’25

Results

  • Compress Mixtral-48B, Gemma-27B on 1xA100
  • Good results with Phi3-14B, Phi2-3B, Mistral-7B
  • Mixtral-48b can run on 1xA100 with 2048-context & batch=4
  • Compress Mamba-3B, FalconMamba-7B, Whisper-med

Future works: updating LLMs

  • Continual learning too costly
    • Every information must be seen 1000x during training
    • Forgetting increases linearly \(\rightarrow\) rehearsal
  • Investigating gradients-free knowledge editing

 

Thank you !

cerisara@loria.fr

Sources of all figures

  • scaling test time: https://openai.com/index/learning-to-reason-with-llms/ (scalingtest.png)
  • Kaplan’s training scaling laws: https://arxiv.org/pdf/2001.08361 (scale2.png)
  • scaling law ICL: https://arxiv.org/html/2501.00070v1 (scalingicl.png)
  • quantization effect on scaling laws: https://arxiv.org/html/2411.04330v1 (quantlimits.png)
  • Sheared LLama: https://arxiv.org/pdf/2310.06694 (shearedllama.png)
  • low rank weights: https://www.alignmentforum.org/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1 (wrank1.png)
  • forgetting when finetuning: https://openreview.net/pdf?id=0BMg0OgNTP (forgetting.png)
  • All other figures are from our team