class: center, middle background-image:url(images/data-background-light.jpg) # NLP Embeddings ## Nancy, 2023-2024 .footnote[.bold[[Christophe Cerisara](mailto:cerisara@loria.fr) CNRS / LORIA]] --- .center[ ## QCM1 ] .center[
] --- .center[ ## Word representations ] .left-column[ #### Discrete ] .right-column[ - Words are symbols - They need to be represented in $R^d$ for processing ] --- .center[ ## Word representations ] .left-column[ #### Discrete ] .right-column[ - Words are symbols - They need to be represented in $R^d$ for processing - Simplest: $d=1$ - cat=1, table=2, dog=3 - natural distances (Euclidian, cosine...) are meaningless - d(cat,dog)=2, d(cat,table)=1 ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector ] .right-column[ - Better: $d=|V|$ - We want d(W1,W2)=d(W1,W3) - Put each word in the unit hyper-sphere - Make all pairs of vectors orthogonal - Symetries: rotation, permutation... ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector ] .right-column[ - each word == a unit coordinate vector in a high-dimensional space .center[
] ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector ] .right-column[ - each word == a unit coordinate vector in a high-dimensional space .center[
] **One-hot vectors** .center[
] ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector #### Embeddings ] .right-column[ - But having all words at the same distance is not ideal - And we face the curse of dimensionality ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector #### Embeddings ] .right-column[ - But having all words at the same distance is not ideal - And we face the curse of dimensionality We want to find word vectors that encode part of lexical semantics: .center[
] ] --- .center[ ## Embeddings ] .center[
] - Long history, race since 2018 - Colors = types of approaches - https://github.com/Separius/awesome-sentence-embedding ... --- .center[ ## Embeddings ] .center[
] Prehistory (?): vector space models --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis ] .right-column[ "You shall know a word by the company it keeps" [Firth, 1957] ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis ] .right-column[ "You shall know a word by the company it keeps" [Firth, 1957] - Distributional semantics is a theory of meaning - Vector Space Models is an implementation of DS - Neural embeddings also ! ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ] .right-column[ - Term-document matrix gives words co-occurrence: .tablematrix[ Lemma | Doc1 | Doc2 -----------|------|----- cat | 5 | 2 dog | 7 | 0 table | 2 | 6 feline | 3 | 0 ] - Dot-product between 2 vectors: $$X \cdot Y = \sum\_i X_i Y_i$$ ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ] .right-column[ - terms are similar if they tend to occur in the same documents - dot product of lines gives the correlation between terms: ``` import numpy cat=numpy.array([5,2]) dog=numpy.array([7,0]) table=numpy.array([2,6]) numpy.dot(cat,dog) numpy.dot(cat,table) ``` ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ] .right-column[ Main issues with this basic term-document matrix: - Dimensions quickly become very large - Contains lots of noise ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ] .right-column[ ### Latent Semantic Analysis Deerwester et al., 1990: - Singular Value Decomposition $$X\_{M\times N} = U\_{M\times k} \Sigma\_{k\times k} V\_{k\times N}^T$$ - $U$ projects the original term vectors into a subspace $k=\min(M,N)$ - each row $t_i$ of $U$ corresponds to one term - each column $d_j$ of $V^T$ corresponds to one document - $\Sigma$ is diagonal = singular values: we just keep the $k$ largest ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ] .right-column[ ### Latent Semantic Analysis - New term vectors = $\Sigma^{(k)} t_i$ - Dimensions get combined into the subspace: - handle synonymy: (cat, feline) becomes (1.9*cat + 0.2*feline) ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ] .right-column[ ### Latent Semantic Analysis - Deerwester et al., 1990 - Landauer, 1997: good results on the TOEFL synonym questions - Turney, 2010: show that dimensions encode lexical or topical meanings ] --- .center[ ## Embeddings ] .center[
] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ] .right-column[ ### Random Indexing - LSA issues: - SVD is costly - Need to retrain when adding documents ! - Sahlgren, 2006 - Fast and online method - Obs: near-orthogonality of random vectors in high-dim space - Johnson-Lindenstrauss lemma: projection into random high-dim subspace approx. preserves distances $$X\_{M\times N} R\_{N\times k} = Y\_{M\times k}$$ ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ] .right-column[ ### Random Indexing $$X\_{M\times N} R\_{N\times k} = Y\_{M\times k}$$ - Achlioptas (2001): $R\_{N\times k}$ i.i.d. with 0-mean and 1-var $$r\_{i,j} = +1,0,-1 \sim Multinomial(\frac{\epsilon/2}k,\frac{k-\epsilon}k,\frac{\epsilon/2}k)$$ - Implementation: - Starts from random $R$ for every word - Sum vectors that co-occur - Extensions: hashed-RI (2011) ... ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ] .right-column[ ### Other Vector Space Models - Other derivatives of LSA: - Hyperspace Analogue to Language (HAL): (Burgess, 1997) - BEAGLE (Jones, 2007) ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ##### Word context ] .right-column[ ### Term-context (co-occurrence) matrix - 2 words co-occur if they are in the same sentence / window - Extensions: syntactic context... - Point-wise mutual information (Church, 1989) - Do words x and y co-occur more than if they were independent ? - Or TF-IDF (Sparck Jones 1972) ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ##### Word context ##### GloVe ] .right-column[ ### GloVe - Pennington, 2014 - Trains a log-linear model on the words co-occurrence matrix $X$: $$J=\sum\_{i,j} f(X\_{ij})( w\_i^Tw\_j + b\_i + b\_j - \log X\_{ij} )^2$$ - Intuition: distance between word vectors should become equal to $\log (X\_{ij})$ - then, $(w\_i-w\_j)^Tw\_k = \log \frac{X\_{ik}}{X\_{jk}}$ - $\simeq$ do $i$ and $j$ share the same contexts $k$ ? - as good as Word-to-Vec ! ] --- .center[ ## Embeddings ] .center[
] Step 2: the deep learning explosion --- .center[ ## Bayesian perspective ] .left-column[ #### LDA ] .right-column[ ### Latent Dirichlet Allocation - Blei, Ng & Jordan, 2003 - $\varphi$: proba of a word given a topic - $\theta$: proba of topic given document .center[
] ] --- .center[ ## Bayesian perspective ] .left-column[ #### LDA ] .right-column[ ### Latent Dirichlet Allocation - Generative model: - sample topic $Z$ - sample word $W$ according to mixture of topics for document $M$ - Trained to maximize the proba of the observations - Gives an interpretable "document embedding" $\theta$ - Basis of the **Topic Models** field - 2016: LDA2Vec: merge LDA + W2V - https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term= ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert ] .right-column[ ### Word embeddings - Bengio proposed the term "word embedding" in 2003, as a by-product of a neural language model - But Collobert showed in "A unified architecture for natural language processing" (2008) that, when trained on sufficiently large dataset, they carry semantic meaning and may be used in downstream tasks. ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert ] .right-column[ .center[ ### Collobert embeddings
] ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert ] .right-column[ .center[ ### Collobert embeddings
] ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert #### Mikolov ] .right-column[ ### Word-to-Vec - Mikolov, 2013 - Most famous word embeddings, because: - Released a very fast C code - New approximations to make it faster (negative sampling, hierarchical softmax..) - Training on large datasets becomes super-easy - Big companies start to pretrain W2V on huge datasets and distribute them for transfer learning ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert #### Mikolov ] .right-column[ ### Word-to-Vec .center[
] ] --- .center[ ## Cosine distance ] .left-column[ #### Cosine ] .right-column[ ### Cosine similarity - How to measure the similarity between word vectors ? - Issue with dot-product: longer vectors -> larger values - Most common distance: **cosine distance** - can be computed efficiently with the dot product: $$cos(a,b)=\frac{a \cdot b}{||A|| \times ||B||}$$ ] --- .center[ ## So far, so good ? ] - Word embeddings capture part of lexical semantics - They are helpful in downstream tasks (**transfer learning**) Examples: - Predicting sentiments - Compute POStags, detect Named Entities - Synactic parsing - Question-Answering, translation, summarization... - ... --- .center[ ## So far, so good ? ] - Word embeddings capture part of lexical semantics - They are helpful in downstream tasks (**transfer learning**) - But... - What about Out-of-vocabulary words ? - What about polysemy ? - What about context-dependent meaning ? - What about multiple languages ? - What about multi-word expressions ? - What about sentence embeddings ? --- .center[ ## Embeddings ] .center[
] --- .center[ ## Contextual Embeddings ] How to handle polysemy ? - With context-dependent words embeddings - 2018: the "NLP's ImageNet moment" - Problem: - You have to distribute a complete **model**, which you have to run on your data and which returns a vector per word - constraints the programming language - much harder to "fine-tune" - But still, this model has been trained on a huge dataset, and the returned embedding encode all of this information --- .center[ ## The rise of Language Models ] .left-column[ #### LM #### n-gram ] .right-column[ - Language Model is the basis of all modern embeddings ! - Given past words, a LM predicts the next word: - Basic LM: n-grams ] .center[
] --- .center[ ## The rise of Language Models ] .left-column[ #### LM #### n-gram #### NN-LM ] .right-column[ - feed forward NN, recurrent NN, transformers... ] .center[
] --- .center[ ## The rise of Language Models ] .left-column[ #### LM #### n-gram #### NN-LM ] .right-column[ - Feed-forward network: - look at fixed-size history window - nb parms increase with window size - not robust to insertions/deletions/... - RNN: - look at "all" history (in practice, about 100 words) - nb parms constant - robust to insertions/deletions/... - gives more importance to most recent history ] --- .center[ ## The rise of Language Models ] .left-column[ #### LM #### n-gram #### NN-LM ] .right-column[ - LSTM/GRU: same as RNN, but... - look at "all" history (in practice, about 1000 words) - thanks to skip connections that reduce vanishing gradient - Transformers: - look at fixed-size history window - nb parms constant - robust to insertions/deletions/... - gives same importance to all history ] --- .center[ ## Contextual Embeddings: ELMo ] .left-column[ #### ELMO ] .right-column[ - From *Allen-NLP* (2018): Huge improvements - Character-based; - Trained to predict the next word (Language Model) - "Bi-directional" RNN, but both directions are trained separately ] --- .center[ ## Contextual Embeddings: ELMo ] .left-column[ #### ELMO ] .right-column[ - the word embeddings is a weighted combination of hidden representations from every layer ]
--- .center[ ## Attention is all you need ] .left-column[ #### ELMO #### BERT ] .right-column[ - attention: originally on top of RNN - transformers: no more recurrence
] --- .center[ ## Attention is all you need ] .left-column[ #### ELMO #### BERT ] .right-column[ ]
--- .center[ ## Attention is all you need ] .left-column[ #### ELMO #### BERT ] .right-column[
] --- .center[ ## Contextual Embeddings: BERT ] .left-column[ #### ELMO #### BERT ] .right-column[ - Exploit *Transformers* - Replace LM objective by "fill in the masked words" - Trains both directions simultaneously - Represent input as **subwords** ]
--- .center[ ## Contextual Embeddings: GPT ] .left-column[ #### ELMO #### BERT #### GPT ] .right-column[ - GPT is a classical Language Model (left-right), but based on transformers. - subword units: **Byte-Pair Encoding** - Fine-tune the base model on target task for transfer learning OpenAI: "GPT2: the AI that's too dangerous to release" - GPT-1 = ULMFit + Transformer - CPT-2 = GPT-1 + reddit + gpus ] --- .center[ ## Contextual Embeddings: XLNet ] .left-column[ #### ELMO #### BERT #### GPT #### XLNet ] .right-column[ - XLNet = Google/CMU - Based on BERT: "improves upon BERT on 20 tasks" - Get rid of the artificial MASK token - Uses the *Transformer-XL* = Transformer with recurrence (pass hiden states between seqs) ] --- .center[ ## Contextual Embeddings: XLNet ] .left-column[ #### ELMO #### BERT #### GPT #### XLNet ] .right-column[ - Killer idea: Permutation Language Model - Predict tokens in random order, cumulate them to build the context - Forces to model 2 directions simultaneously
] --- .center[ ## Contextual Embeddings
] --- .center[ ## Contextual Embeddings
] --- .center[ ## Foundational models
] --- .center[ ## Embeddings ] .center[
] Step 3: word, contextual word, what about full text ? --- .center[ ## Sentence Embeddings: Averaging ] .left-column[ #### Averaging ] .right-column[ Just average words embeddings in the sentence ! - Old baseline, but still [hard-to-beat](https://openreview.net/forum?id=SyK00v5xx) ] --- .center[ ## Sentence Embeddings: Language Model ] .left-column[ #### Averaging #### NN-LM ] .right-column[ - Bengio (2003): a NN-LM learns simultaneously - words representations - sequence of words probabilities - Google has released [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/1) - trained on Google News 200B - maps any sentence into 128-dimensional embeddings ] --- .center[ ## Sentence Embeddings: Doc2Vec ] .left-column[ #### Averaging #### NN-LM #### Doc2Vec ] .right-column[ - Proposed by Mikolov et al: [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053) .center[
] ] --- .center[ ## Sentence Embeddings: Skip-thought ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought ] .right-column[ - Encoder-Decoder that re-gnerates the surrounding sentences .center[
] ] --- .center[ ## Sentence Embeddings: Quick-thought ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought ] .right-column[ - Replace decoder by a classifier .center[
] ] --- .center[ ## Sentence Embeddings: InferSent ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought #### InferSent ] .right-column[ - Supervised encoder, trained on the Stanford Natural Language Inference datasets .center[
] ] --- .center[ ## Sentence Embeddings: Universal Sentence Encoder ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought #### InferSent #### Universal ] .right-column[ - 2018: [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) - 2 fast models trained on many tasks: - Transformer - Deep Averaging Network - Produce 512-embeddings for any text ] --- .center[ ## BERT Sentence Embeddings ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought #### InferSent #### Universal #### BERT ] .right-column[ - Adds a special token CLS - its embedding does not depend on a single word, but on all words of the sentence - Good for classification - But bad for comparing/interpreting embeddings through distances! ] --- .center[ ## Sentence-BERT (S-BERT) ] - [EMNLP, 2019] - Training 1: classification .center[
] --- .center[ ## Sentence-BERT (S-BERT) ] - Trained on NLI task - Training 2: triplet loss $$\max(||s_a-s_p|| - ||s_a-s_n|| + \epsilon,0)$$ - makes related pairs $(s_a,s_p)$ closer than unrelated pairs $(s_a,s_n)$ --- .center[ ## Note on contrastive training ] - Standard training: - embeddings -> layer -> scores per class -> softmax -> proba - optimizes classification accuracy - so embedding space can be anything --- .center[ ## Note on contrastive training ] - Contrastive training: - aka ranking loss, metrics learning - learns the *distance* between embeddings - so embedding space can be interpreted --- .center[ ## Note on contrastive training ] - Pair-wise Contrastive training: - when 2 samples belong to the same class, their embeddings should be close - when 2 samples belong to different classes, their embeddings should be far - embeddings -> distance -> loss --- .center[ ## Note on contrastive training ] - Triplet loss: - anchor $a$ - positive sample $p$ - negative sample $n$ - $d(a,p)$ should be smaller than $d(a,n)$ $$\max(||s_a-s_p|| - ||s_a-s_n|| + \epsilon,0)$$ --- .center[ ## Note on contrastive training ] - Can be used for *instance-based classification*: - annotate manually 1 example per class - unknown sample: compare the distance with every class prototype - pick the class of the closest one --- .center[ ## Tools ] - Gensim - Spacy - FastText - Senteval: Transfer learning tasks to evaluate embeddings - Huggingface: transformers --- .center[ ## Gensim ] .left-column[ #### Gensim ] .right-column[ - Oldest python lib for embeddings (start from 2008) from Radim Rehrurek (CZ) - Designed for semantic/topic modelling - Includes models: LDA, LSI, TFIDF, W2V, DOC2VEC, FastText... - Includes corpora: test8... Get all the available datasets and models: ``` import gensim.downloader as api api.info() ``` - See https://radimrehurek.com/gensim ] --- .center[ ## SpaCy ] .left-column[ #### Gensim #### SpaCy ] .right-column[ - from 2015 - focused on modern NLP, including deep learning models (work with tensorflow, pytorch...) - Includes recent pretrained models: BERT, ULMFiT, XLNET... - Very active since 2018 - See https://spacy.io/ ] --- .center[ ## FastText (Facebook) ] .left-column[ #### Gensim #### SpaCy #### FastText ] .right-column[ - Fast training of Skipgram / CBOW word embeddings - Written in C but can be integrated into python - [Combine several tricks](https://arxiv.org/pdf/1712.09405.pdf) to improve embeddings - subsample frequent words: $p_{discard} = 1-\sqrt{\alpha/f_w}$ - position-dependent features (CBOW): train a weight per position in the context window, then computes a weighted average of the word vectors in the context - phrase representations: merge ngrams with high mutual information into a single token - add subword information: - decompose words into char-ngrams - one embedding per char-ngrams - final word vector = $w + \frac 1 N \sum_n^N c_n$ ] --- .center[ ## FastText ] .left-column[ #### Gensim #### SpaCy #### FastText ] .right-column[ - Includes a text classifier: - Embeddings + linear model + softmax - Simple, but competitive with state-of-the-art - Extremely fast .center[
] - Has a python implementation, but it's not officially supported - Distributes word vectors for 157 languages, and multi-lingual word vectors in 44 languages (see also XLMR) ] --- .center[ ## SentEval ] .left-column[ #### Gensim #### SpaCy #### FastText #### SentEval ] .right-column[ - From Facebook (Alexis Conneau): a toolkit to evalute the quality of sentence embeddings - see https://github.com/facebookresearch/SentEval - Includes skipthought, Google-USE and their own InferSent encoders - Makes it "easy" to evaluate transfer learning with embeddings on more than 20 tasks: MR, TREC, SST... ] --- .center[ ## HuggingFace ] .left-column[ #### Gensim #### SpaCy #### FastText #### SentEval #### HuggingFace ] .right-column[ - HuggingFace is a company making chatbots - Released the **transformers** library - https://github.com/huggingface/pytorch-transformers - Includes the most recent contextual word embeddings: - BERT (from Google), RoBERTa, DeBERTa... - GPT (from OpenAI), GPT2, GPT-J, GPT-NeoX, Bloom - Transformer-XL (from Google/CMU) - XLNet (from Google/CMU) - XLM (from Facebook), XLM-R, MBERT... - T5, BART, T0pp... - ... ] --- .center[ ## Conclusions about the tools ] - Research on embeddings since 2018 is **extremely** active - New models appear every few months (*The ImageNet effect*) - Open-source implementations are released nearly immediatly - So the software landscape for embeddings will still evolve ! - Don't become an "expert" with one tool, or you'll get stuck - Better look for the most appropriate tool at the moment for your task --- .center[ ## Limitations ] - Largest models are the best - but they're owned by private compagnies - Some companies make some open-source (Meta, EletherAI, Huggingface...) - We do not know how to reduce bias, control hallucinations, guarantee correctness - Many styles are not well understood: poetry... - Costs!! (inference, training, updating...) --- .center[ ## Conclusions about the tools ] - Recommendations: - **Do** use the best embeddings for your research tasks in NLP - Don't train embeddings ! You can't... - trend: python; tensorflow or pytorch for maximum flexibility .center[
] --- .center[ ## Conclusions about the tools ] .center[
] --- .center[ ## Conclusions about the tools ] .center[
] --- name: last-page class: middle, center, inverse "Gemini has been released by Google!" ## That's all folks (for now)! Slideshow created using [remark](http://github.com/gnab/remark).