Embeddings

class: center, middle
background-image:url(images/data-background-light.jpg)

# NLP Embeddings

## Nancy, 2023-2024

.footnote[.bold[[Christophe Cerisara](mailto:cerisara@loria.fr) CNRS / LORIA]]

---
.center[
## QCM1
]

.center[
<img src="qcm1.png" width="600cm"/>
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
]
.right-column[
- Words are symbols
- They need to be represented in $R^d$ for processing

]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
]
.right-column[
- Words are symbols
- They need to be represented in $R^d$ for processing
- Simplest: $d=1$
    - cat=1, table=2, dog=3
    - natural distances (Euclidian, cosine...) are meaningless
    - d(cat,dog)=2, d(cat,table)=1
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
]
.right-column[
- Better: $d=|V|$
    - We want d(W1,W2)=d(W1,W3)
    - Put each word in the unit hyper-sphere
    - Make all pairs of vectors orthogonal
    - Symetries: rotation, permutation...
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
]
.right-column[
- each word == a unit coordinate vector in a high-dimensional space

.center[
<img src="images/axes.svg" width="200cm"/>
]

]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
]
.right-column[
- each word == a unit coordinate vector in a high-dimensional space

.center[
<img src="images/axes.svg" width="200cm"/>
]

**One-hot vectors**

.center[
<img src="images/onehot.svg" width="290cm"/>
]

]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
  #### Embeddings
]
.right-column[
- But having all words at the same distance is not ideal
- And we face the curse of dimensionality
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
  #### Embeddings
]
.right-column[
- But having all words at the same distance is not ideal
- And we face the curse of dimensionality

We want to find word vectors that encode part of lexical semantics:

.center[
<img src="images/lexsem.svg" width="200cm"/>
]

]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

- Long history, race since 2018
- Colors = types of approaches
- https://github.com/Separius/awesome-sentence-embedding ...

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

Prehistory (?): vector space models

---
.center[
## Distributional semantics
]

.left-column[
  #### Distributional hypothesis
]
.right-column[

"You shall know a word by the company it keeps" [Firth, 1957]

]

---
.center[
## Distributional semantics
]

.left-column[
  #### Distributional hypothesis
]
.right-column[

"You shall know a word by the company it keeps" [Firth, 1957]

- Distributional semantics is a theory of meaning
- Vector Space Models is an implementation of DS
- Neural embeddings also !

]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models

]
.right-column[

- Term-document matrix gives words co-occurrence:

.tablematrix[
Lemma      | Doc1 | Doc2
-----------|------|-----
cat        | 5    |   2
dog        | 7    |   0
table      | 2    |   6
feline     | 3    |   0
]

- Dot-product between 2 vectors:
$$X \cdot Y = \sum\_i X_i Y_i$$
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models

]
.right-column[

- terms are similar if they tend to occur in the same documents
    - dot product of lines gives the correlation between terms:
```
import numpy
cat=numpy.array([5,2])
dog=numpy.array([7,0])
table=numpy.array([2,6])
numpy.dot(cat,dog)
numpy.dot(cat,table)
```

]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models

]
.right-column[

Main issues with this basic term-document matrix:

- Dimensions quickly become very large
- Contains lots of noise

]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
]
.right-column[
### Latent Semantic Analysis

Deerwester et al., 1990:

- Singular Value Decomposition
$$X\_{M\times N} = U\_{M\times k} \Sigma\_{k\times k} V\_{k\times N}^T$$
- $U$ projects the original term vectors into a subspace $k=\min(M,N)$
- each row $t_i$ of $U$ corresponds to one term
- each column $d_j$ of $V^T$ corresponds to one document
- $\Sigma$ is diagonal = singular values: we just keep the $k$ largest
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
]
.right-column[
### Latent Semantic Analysis

- New term vectors = $\Sigma^{(k)} t_i$

- Dimensions get combined into the subspace:
    - handle synonymy: (cat, feline) becomes (1.9*cat + 0.2*feline)
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
]
.right-column[
### Latent Semantic Analysis

- Deerwester et al., 1990
- Landauer, 1997: good results on the TOEFL synonym questions
- Turney, 2010: show that dimensions encode lexical or topical meanings
]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
]
.right-column[
### Random Indexing

- LSA issues:
    - SVD is costly
    - Need to retrain when adding documents !
- Sahlgren, 2006
    - Fast and online method
    - Obs: near-orthogonality of random vectors in high-dim space
    - Johnson-Lindenstrauss lemma: projection into random high-dim subspace approx. preserves distances
    $$X\_{M\times N} R\_{N\times k} = Y\_{M\times k}$$
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
]
.right-column[
### Random Indexing

$$X\_{M\times N} R\_{N\times k} = Y\_{M\times k}$$

- Achlioptas (2001): $R\_{N\times k}$ i.i.d. with 0-mean and 1-var
$$r\_{i,j} = +1,0,-1 \sim Multinomial(\frac{\epsilon/2}k,\frac{k-\epsilon}k,\frac{\epsilon/2}k)$$
- Implementation:
    - Starts from random $R$ for every word
    - Sum vectors that co-occur
- Extensions: hashed-RI (2011) ...
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
]
.right-column[
### Other Vector Space Models

- Other derivatives of LSA:
    - Hyperspace Analogue to Language (HAL): (Burgess, 1997)
    - BEAGLE (Jones, 2007)
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
##### Word context
]
.right-column[
### Term-context (co-occurrence) matrix

- 2 words co-occur if they are in the same sentence / window
- Extensions: syntactic context...
- Point-wise mutual information (Church, 1989)
    - Do words x and y co-occur more than if they were independent ?
    - Or TF-IDF (Sparck Jones 1972)
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
##### Word context
##### GloVe
]
.right-column[
### GloVe

- Pennington, 2014
- Trains a log-linear model on the words co-occurrence matrix $X$:

$$J=\sum\_{i,j} f(X\_{ij})( w\_i^Tw\_j + b\_i + b\_j - \log X\_{ij} )^2$$

- Intuition: distance between word vectors should become equal to $\log (X\_{ij})$
    - then, $(w\_i-w\_j)^Tw\_k = \log \frac{X\_{ik}}{X\_{jk}}$
    - $\simeq$ do $i$ and $j$ share the same contexts $k$ ?
- as good as Word-to-Vec !

]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

Step 2: the deep learning explosion

---

.center[
## Bayesian perspective
]

.left-column[
#### LDA
]
.right-column[
### Latent Dirichlet Allocation

- Blei, Ng & Jordan, 2003
- $\varphi$: proba of a word given a topic
- $\theta$: proba of topic given document

.center[
<img src="images/LDA.png" width="400cm"/>
]

]
---

.center[
## Bayesian perspective
]

.left-column[
#### LDA
]
.right-column[
### Latent Dirichlet Allocation

- Generative model:
  - sample topic $Z$
  - sample word $W$ according to mixture of topics for document $M$
  - Trained to maximize the proba of the observations
- Gives an interpretable "document embedding" $\theta$
- Basis of the **Topic Models** field
- 2016: LDA2Vec: merge LDA + W2V
- https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
]
.right-column[
### Word embeddings

- Bengio proposed the term "word embedding" in 2003, as a by-product of a neural language model
- But Collobert showed in "A unified architecture for natural language processing" (2008) that, when trained on sufficiently large dataset, they carry semantic meaning and may be used in downstream tasks.
]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
]
.right-column[
.center[
### Collobert embeddings

<img src="images/collobert1.png" width="400cm"/>
]

]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
]
.right-column[
.center[
### Collobert embeddings

<img src="images/collobert2.png" width="400cm"/>
]

]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
#### Mikolov
]
.right-column[
### Word-to-Vec

- Mikolov, 2013
- Most famous word embeddings, because:
    - Released a very fast C code
    - New approximations to make it faster (negative sampling, hierarchical softmax..)
    - Training on large datasets becomes super-easy
    - Big companies start to pretrain W2V on huge datasets and distribute them for transfer learning
]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
#### Mikolov
]
.right-column[
### Word-to-Vec

.center[
<img src="images/word2vec.png" width="600cm"/>
]

]

---

.center[
## Cosine distance
]

.left-column[
#### Cosine
]
.right-column[
### Cosine similarity

- How to measure the similarity between word vectors ?
- Issue with dot-product: longer vectors -> larger values
- Most common distance: **cosine distance**
- can be computed efficiently with the dot product:

$$cos(a,b)=\frac{a \cdot b}{||A|| \times ||B||}$$

]

---

.center[
## So far, so good ?
]

- Word embeddings capture part of lexical semantics
- They are helpful in downstream tasks (**transfer learning**)

Examples:

- Predicting sentiments
- Compute POStags, detect Named Entities
- Synactic parsing
- Question-Answering, translation, summarization...
- ...

---

.center[
## So far, so good ?
]

- Word embeddings capture part of lexical semantics
- They are helpful in downstream tasks (**transfer learning**)
- But...
    - What about Out-of-vocabulary words ?
    - What about polysemy ?
    - What about context-dependent meaning ?
    - What about multiple languages ?
    - What about multi-word expressions ?
    - What about sentence embeddings ?

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

---

.center[
## Contextual Embeddings
]

How to handle polysemy ?

- With context-dependent words embeddings
    - 2018: the "NLP's ImageNet moment"
- Problem:
    - You have to distribute a complete **model**, which you have to run on your data and which returns a vector per word
        - constraints the programming language
        - much harder to "fine-tune"
    - But still, this model has been trained on a huge dataset, and the returned embedding encode all of this information

---

.center[
## The rise of Language Models
]

.left-column[
#### LM
#### n-gram
]
.right-column[
- Language Model is the basis of all modern embeddings !
- Given past words, a LM predicts the next word:
- Basic LM: n-grams
]

.center[
<img src="images/LM1.jpg" width="800cm"/>
]

---

.center[
## The rise of Language Models
]

.left-column[
#### LM
#### n-gram
#### NN-LM
]
.right-column[
- feed forward NN, recurrent NN, transformers...
]

.center[
<table>
<tr>
<td>
<img src="images/mlpLM.png" width="380cm" />
</td>
<td style="width:6cm">
</td>
<td>
<img src="images/rnnLM.png" width="380cm" />
</td>
</tr>
</table>
]

---

.center[
## The rise of Language Models
]

.left-column[
#### LM
#### n-gram
#### NN-LM
]
.right-column[
- Feed-forward network:
  - look at fixed-size history window
  - nb parms increase with window size
  - not robust to insertions/deletions/...
- RNN:
  - look at "all" history (in practice, about 100 words)
  - nb parms constant
  - robust to insertions/deletions/...
  - gives more importance to most recent history
]

---

.center[
## The rise of Language Models
]

.left-column[
#### LM
#### n-gram
#### NN-LM
]
.right-column[
- LSTM/GRU: same as RNN, but...
  - look at "all" history (in practice, about 1000 words)
  - thanks to skip connections that reduce vanishing gradient
- Transformers:
  - look at fixed-size history window
  - nb parms constant
  - robust to insertions/deletions/...
  - gives same importance to all history
]

---

.center[
## Contextual Embeddings: ELMo
]

.left-column[
#### ELMO
]
.right-column[

- From *Allen-NLP* (2018): Huge improvements
    - Character-based;
    - Trained to predict the next word (Language Model)
    - "Bi-directional" RNN, but both directions are trained separately
]

---

.center[
## Contextual Embeddings: ELMo
]

.left-column[
#### ELMO
]
.right-column[

- the word embeddings is a weighted combination of hidden representations from every layer
]

<img src="images/elmo.jpg" width="1000cm"/>

---

.center[
## Attention is all you need
]

.left-column[
#### ELMO
#### BERT
]
.right-column[

- attention: originally on top of RNN
- transformers: no more recurrence

<img src="../images/attention.png" width="300cm"/>
]

---

.center[
## Attention is all you need
]

.left-column[
#### ELMO
#### BERT
]
.right-column[
]

<img src="../images/transformer0.png" width="800cm"/>

---

.center[
## Attention is all you need
]

.left-column[
#### ELMO
#### BERT
]
.right-column[

<img src="../images/transformer.png" width="350cm"/>

]

---

.center[
## Contextual Embeddings: BERT
]

.left-column[
#### ELMO
#### BERT
]
.right-column[

- Exploit *Transformers*
    - Replace LM objective by "fill in the masked words"
    - Trains both directions simultaneously
    - Represent input as **subwords**
]

<img src="images/bert.png" width="900cm"/>

---

.center[
## Contextual Embeddings: GPT
]

.left-column[
#### ELMO
#### BERT
#### GPT
]
.right-column[

- GPT is a classical Language Model (left-right), but based on transformers.
- subword units: **Byte-Pair Encoding**
- Fine-tune the base model on target task for transfer learning

OpenAI: "GPT2: the AI that's too dangerous to release"

- GPT-1 = ULMFit + Transformer
- CPT-2 = GPT-1 + reddit + gpus

]

---

.center[
## Contextual Embeddings: XLNet
]

.left-column[
#### ELMO
#### BERT
#### GPT
#### XLNet
]
.right-column[

- XLNet = Google/CMU
- Based on BERT: "improves upon BERT on 20 tasks"
    - Get rid of the artificial MASK token
- Uses the *Transformer-XL* = Transformer with recurrence (pass hiden states between seqs)

]

---

.center[
## Contextual Embeddings: XLNet
]

.left-column[
#### ELMO
#### BERT
#### GPT
#### XLNet
]
.right-column[

- Killer idea: Permutation Language Model
    - Predict tokens in random order, cumulate them to build the context
    - Forces to model 2 directions simultaneously

<img src="images/xlnet.gif" width="600cm"/>

]

---

.center[
## Contextual Embeddings

<img src="images/t-nlg.png" width="900cm"/>
]

---

.center[
## Contextual Embeddings

<img src="images/gpt3.jpg" width="900cm"/>
]

---

.center[
## Foundational models

<img src="images/PLM2022.jpeg" width="900cm"/>
]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

Step 3: word, contextual word, what about full text ?

---

.center[
## Sentence Embeddings: Averaging
]

.left-column[
#### Averaging
]
.right-column[

Just average words embeddings in the sentence !

- Old baseline, but still [hard-to-beat](https://openreview.net/forum?id=SyK00v5xx)
]

---

.center[
## Sentence Embeddings: Language Model
]

.left-column[
#### Averaging
#### NN-LM
]
.right-column[

- Bengio (2003): a NN-LM learns simultaneously
    - words representations
    - sequence of words probabilities
- Google has released [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/1)
    - trained on Google News 200B
    - maps any sentence into 128-dimensional embeddings

]

---

.center[
## Sentence Embeddings: Doc2Vec
]

.left-column[
#### Averaging
#### NN-LM
#### Doc2Vec
]
.right-column[

- Proposed by Mikolov et al: [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053)

.center[
<img src="images/doc2vec.png" width="800cm"/>
]

]

---

.center[
## Sentence Embeddings: Skip-thought
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
]
.right-column[

- Encoder-Decoder that re-gnerates the surrounding sentences
.center[
<img src="images/skipthought.png" width="600cm"/>
]
]

---

.center[
## Sentence Embeddings: Quick-thought
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
]
.right-column[

- Replace decoder by a classifier
.center[
<img src="images/quickthought.png" width="600cm"/>
]
]

---

.center[
## Sentence Embeddings: InferSent
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
#### InferSent
]
.right-column[

- Supervised encoder, trained on the Stanford Natural Language Inference datasets

.center[
<img src="images/infersent.png" width="400cm"/>
]
]

---

.center[
## Sentence Embeddings: Universal Sentence Encoder
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
#### InferSent
#### Universal
]
.right-column[

- 2018: [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175)
- 2 fast models trained on many tasks:
    - Transformer
    - Deep Averaging Network
- Produce 512-embeddings for any text

]

---

.center[
## BERT Sentence Embeddings
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
#### InferSent
#### Universal
#### BERT
]
.right-column[

- Adds a special token CLS
- its embedding does not depend on a single word, but on all words of the sentence
- Good for classification
- But bad for comparing/interpreting embeddings through distances!

]

---

.center[
## Sentence-BERT (S-BERT)
]

- [EMNLP, 2019]
- Training 1: classification

.center[
<img src="../images/sbert.png" width="350cm"/>
]

---

.center[
## Sentence-BERT (S-BERT)
]

- Trained on NLI task
- Training 2: triplet loss
$$\max(||s_a-s_p|| - ||s_a-s_n|| + \epsilon,0)$$
- makes related pairs $(s_a,s_p)$ closer than unrelated pairs $(s_a,s_n)$

---

.center[
## Note on contrastive training
]

- Standard training:
    - embeddings -> layer -> scores per class -> softmax -> proba
    - optimizes classification accuracy
    - so embedding space can be anything

---

.center[
## Note on contrastive training
]

- Contrastive training:
    - aka ranking loss, metrics learning
    - learns the *distance* between embeddings
    - so embedding space can be interpreted

---

.center[
## Note on contrastive training
]

- Pair-wise Contrastive training:
    - when 2 samples belong to the same class, their embeddings should be close
    - when 2 samples belong to different classes, their embeddings should be far
    - embeddings -> distance -> loss

---

.center[
## Note on contrastive training
]

- Triplet loss:
    - anchor $a$
    - positive sample $p$
    - negative sample $n$
    - $d(a,p)$ should be smaller than $d(a,n)$

$$\max(||s_a-s_p|| - ||s_a-s_n|| + \epsilon,0)$$

---

.center[
## Note on contrastive training
]

- Can be used for *instance-based classification*:
    - annotate manually 1 example per class
    - unknown sample: compare the distance with every class prototype
    - pick the class of the closest one

---

.center[
## Tools
]

- Gensim
- Spacy
- FastText
- Senteval: Transfer learning tasks to evaluate embeddings
- Huggingface: transformers

---

.center[
## Gensim
]

.left-column[
#### Gensim
]
.right-column[

- Oldest python lib for embeddings (start from 2008) from Radim Rehrurek (CZ)
- Designed for semantic/topic modelling
- Includes models: LDA, LSI, TFIDF, W2V, DOC2VEC, FastText...
- Includes corpora: test8...

Get all the available datasets and models:
```
import gensim.downloader as api
api.info()
```

- See https://radimrehurek.com/gensim

]

---

.center[
## SpaCy
]

.left-column[
#### Gensim
#### SpaCy
]
.right-column[

- from 2015
- focused on modern NLP, including deep learning models (work with tensorflow, pytorch...)
- Includes recent pretrained models: BERT, ULMFiT, XLNET...
- Very active since 2018

- See https://spacy.io/
]

---

.center[
## FastText (Facebook)
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
]
.right-column[

- Fast training of Skipgram / CBOW word embeddings
- Written in C but can be integrated into python
- [Combine several tricks](https://arxiv.org/pdf/1712.09405.pdf) to improve embeddings
    - subsample frequent words: $p_{discard} = 1-\sqrt{\alpha/f_w}$
    - position-dependent features (CBOW): train a weight per position in the context window, then computes a weighted average of the word vectors in the context
    - phrase representations: merge ngrams with high mutual information into a single token
    - add subword information:
        - decompose words into char-ngrams
        - one embedding per char-ngrams
        - final word vector = $w + \frac 1 N \sum_n^N c_n$
]

---

.center[
## FastText
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
]
.right-column[
- Includes a text classifier:
    - Embeddings + linear model + softmax
    - Simple, but competitive with state-of-the-art
- Extremely fast

.center[
<img src="images/fasttextspeed.png" width="700cm"/>
]

- Has a python implementation, but it's not officially supported
- Distributes word vectors for 157 languages, and multi-lingual word vectors in 44 languages (see also XLMR)

]

---

.center[
## SentEval
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
#### SentEval
]
.right-column[

- From Facebook (Alexis Conneau): a toolkit to evalute the quality of sentence embeddings
- see https://github.com/facebookresearch/SentEval
- Includes skipthought, Google-USE and their own InferSent encoders
- Makes it "easy" to evaluate transfer learning with embeddings on more than 20 tasks: MR, TREC, SST...

]

---

.center[
## HuggingFace
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
#### SentEval
#### HuggingFace
]
.right-column[

- HuggingFace is a company making chatbots
- Released the **transformers** library
    - https://github.com/huggingface/pytorch-transformers
- Includes the most recent contextual word embeddings:
    - BERT (from Google), RoBERTa, DeBERTa...
    - GPT (from OpenAI), GPT2, GPT-J, GPT-NeoX, Bloom
    - Transformer-XL (from Google/CMU)
    - XLNet (from Google/CMU)
    - XLM (from Facebook), XLM-R, MBERT...
    - T5, BART, T0pp...
    - ...
]

---

.center[
## Conclusions about the tools
]

- Research on embeddings since 2018 is **extremely** active
    - New models appear every few months (*The ImageNet effect*)
- Open-source implementations are released nearly immediatly
- So the software landscape for embeddings will still evolve !
    - Don't become an "expert" with one tool, or you'll get stuck
    - Better look for the most appropriate tool at the moment for your task

---

.center[
## Limitations
]

- Largest models are the best
    - but they're owned by private compagnies
    - Some companies make some open-source (Meta, EletherAI, Huggingface...)
- We do not know how to reduce bias, control hallucinations, guarantee correctness
- Many styles are not well understood: poetry...
- Costs!! (inference, training, updating...)

---

.center[
## Conclusions about the tools
]

- Recommendations:
    - **Do** use the best embeddings for your research tasks in NLP
    - Don't train embeddings ! You can't...
    - trend: python; tensorflow or pytorch for maximum flexibility

.center[
<img src="images/pytorch.jpg" width="900cm"/>
]

---

.center[
## Conclusions about the tools
]

.center[
<img src="images/pytorch.png" width="800cm"/>
]

---

.center[
## Conclusions about the tools
]

.center[
<img src="../images/jax.png" width="800cm"/>
]

---

name: last-page
class: middle, center, inverse

"Gemini has been released by Google!"

## That's all folks (for now)!

Slideshow created using [remark](http://github.com/gnab/remark).