Christophe Cerisara
2024/2025
cf. https://arxiv.org/pdf/2409.15790v1

| Year | Authors | Contribution |
|---|---|---|
| 2014 | Graves et al | attention for Neural Turing Machines |
| 2014 | Bahdanau attention | application to NLP |
| 2015 | Luong attention | application to NLP |
| 2015 | Xu et al. | soft/global & hard/local |
| 2016 | Cheng, Dong and Lapata | self-att LSTMN |
| 2017 | Vaswani et al. | transformer, 120k citations |
KQV translation
\[\begin{equation} {\scriptsize{ \alpha_i = \frac {\exp(\text{score}(q,k_i))} {\sum_j \exp(\text{score}(q,k_j))} }} \end{equation}\] \[\begin{equation} {\scriptsize{ v' = \sum_i \alpha_i v_i }} \end{equation}\]
| name | score | ref |
|---|---|---|
| content-based | cosine\((q,k)\) | Graves14 |
| additive | \(v^T \tanh (W[q,k])\) | Bahdanau15 |
| location-based | \(\alpha = \text{softmax}(Wq)\) | Luong15 |
| general | \(q^T W k\) | Luong15 |
| dot-product | \(q^T k\) | Luong15 |
| scaled \(\cdot\) | \(\frac {q^t k} {\sqrt{d}}\) | Vaswani17 |

Self-attention with scaled dot-product:
\[V'=\text{softmax}\left(\frac {QK^T}{\sqrt{d}}\right)V\]
from https://slds-lmu.github.io/seminar_nlp_ss20/attention-and-self-attention-for-nlp.html
| Layer | Complexity | Seq. op |
|---|---|---|
| recurrent | \(O(nd^2)\) | \(O(n)\) |
| conv | \(O(knd^2)\) | \(O(1)\) |
| transformer | \(O(n^2d)\) | \(O(1)\) |
| sparse transf | \(O(n\sqrt{n})\) | \(O(1)\) |
| reformer | \(O(n\log n)\) | \(O(\log (n))\) |
| linformer | \(O(n)\) | \(O(1)\) |
| linear transf. | \(O(n)\) | \(O(1)\) |

\[Q,K \in R^{N\times d}\] \[QK^T \in R^{N\times N}\]






matmul on python: 0.042 GFLops
numpy (FORTRAN): 29 GFlops
reimplementation of numpy in C++: 47 GFlops
BLAS with multithreading: 85 GFlops
llama.cpp (focus matrix-vec): 233 GFlops
Intel’s MKL (closed source): 384 GFlops
OpenMP(512x512 matrix): 810 GFlops
exported in llamafile: 790 GFlops