Embeddings hands on

Objectives:

  • manipulate embeddings
  • what is inside embeddings ?

4 exercices next:

  • Python-easy
    • proximity in the embedding space (transformers)
    • FastText
  • Python-hard
    • probing embeddings
    • byte-pair encoding

Embeddings

  • install transformers library + pytorch with conda or pip:
pip install transformers[torch]
from transformers import AutoTokenizer, AutoModel, pipeline
model = AutoModel.from_pretrained('distilbert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
nlp = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
s = 'Do you like cakes ?'
features = nlp(s)
print([features[0][i][:2] for i in range(len(features[0]))])

inputs = tokenizer.encode_plus(s, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(text_tokens)

cosine distance

pip install sklearn
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(v1,v2))
  • compare the cosine-distance between (table, chair) and (table, array) in
    • ``the excel table is too big’’
    • ``the chair is solid’’
    • ``the wood table is too big’’
    • ``the array is filled with numbers’’

FastText (easy)

  • Easy self-explained hands-on on fast-text, with very very low requirements of programming: Exercice: fastText

Byte-pair encodings

  • We want to decompose words into frequent subword sequences
  • Byte-Pair Encoding is a method used in many deep learning models:
    • Build unigram: “low”: 5, “lowest”: 2…
    • Decompose into char: “l o w @”: 5, “l o w e s t @”: 2…
    • Find most frequent unit pair: “ow”: 7
    • Merge into new unit: “l ow @”: 5, “l ow e s t @”: 2…
    • Iterate until a target nb of units is reached

Exercise: BPE

Probing embeddings

Is there linguistic information in the embedding ?

  • TOEFL synonimy test
    • LSA performs as good as English learners
  • Analogies
    • W2V: “king - man + woman = queen”
    • not true any more with BERT

Probing

  • If there is linguistic information encoded in an embedding, it’s not obvious to see it
  • But this information should be exploited by a small model to “tag” sentences with this linguistic property

https://nlp.stanford.edu/~johnhew/interpreting-probes.html

  • the model must be too small to be able to extract itself the linguistic property from the raw sentence
  • it must not be able to do complex processing of the vector
  • it should only relate embeddings to linguistic tag with a direct, simple function

Probing POS

Answer the question: do BERT embeddings embed POS information ?

Methodology:

  • Find a corpus annotated with POS
  • Compute word embeddings on this corpus
  • Train a logistic regression to map Embeddings to POS tags
  • Compute the accuracy of the LR classifier from target vs. random embeddings
import nltk
nltk.download('brown')
nltk.download('universal_tagset')
nltk.corpus.brown.sents()
nltk.corpus.brown.tagged_words(tagset='universal')

Universal tagset:

VERB - verbs (all tenses and modes)
NOUN - nouns (common and proper)
PRON - pronouns
ADJ - adjectives
ADV - adverbs
ADP - adpositions (prepositions and postpositions)
CONJ - conjunctions
DET - determiners
NUM - cardinal numbers
PRT - particles or other function words
X - other: foreign words, typos, abbreviations
. - punctuation

Logistic regression

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

Another reference on this subject: https://pageperso.lis-lab.fr/benoit.favre/pstaln/09_embedding_evaluation.html

Triplet loss

(see github blog)

  • Train ConvNet on MNIST with 10-class cross-entropy loss
  • Extract 2-dim embeddings from penultimate layer:

  • Distance btw classes not good
  • Train a siamese net

  • Distance btw classes are good
  • Train a triplet net

In pytorch

  • CosineEmbeddingLoss = pairwise loss with cosine dist
  • MarginRankingLoss = pairwise loss with euclidian dist
  • TripletMarginLoss = triplet loss with euclidian dist

Triplet loss: exercices

  • Goal: train a linear embedding space with triplet loss
  • Synthetic data:
    • scalar input, 2 classes \(c\in \{0,1\}\)
    • \(x|c \sim N(\mu_c,\sigma_c=0.1)\)
  • Embedding dim = 5
  • lightning = pytorch library that
    • automates cpu/gpu runtime
    • simplifies training loop
    • generates tensorboard logs
  • pytorch lightning in practice:
    • replace and extend nn.Module:
import pytorch_lightning as pl

class Mod(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.W = torch.nn.Linear(1,5)

    def configure_optimizers(self):
        opt = torch.optim.AdamW(self.parameters(), lr = 1e-3)
        return opt

    def training_step(self, batch, batch_idx):
        anc, pos, neg = batch
        ea = self.W(anc)
        ep = self.W(pos)
        en = self.W(neg)
        dp = torch.nn.functional.triplet_margin_loss(ea,ep,en)
        self.log("train_loss", dp, on_step=False, on_epoch=True)
        return dp
  • you need a dataset that generates anchors/pos/neg:
class TripDS(torch.utils.data.Dataset):
    def __init__(self):
        super().__init__()

    def __len__(self):
        return 1000

    def __getitem__(self,i):
        if i%2==0:
            # pair: on sample une ancre from class 1
            xa = torch.randn(1)/10.-0.5
            xp = torch.randn(1)/10.-0.5
            xn = torch.randn(1)/10.+0.5
            return xa,xp,xn
        else:
            # impair: on sample une ancre from class 2
            xa = torch.randn(1)/10.+0.5
            xp = torch.randn(1)/10.+0.5
            xn = torch.randn(1)/10.-0.5
            return xa,xp,xn
  • Train:
traindata = TripDS()
trainloader = torch.utils.data.DataLoader(traindata, batch_size=1, shuffle=False)
mod = Mod()
logger = pl.loggers.TensorBoardLogger(save_dir="logs/", flush_secs=1)
trainer = pl.Trainer(limit_train_batches=1.0, max_epochs=1000, log_every_n_steps=1,logger=logger)
trainer.fit(model=mod, train_dataloaders=trainloader)
tensorboard --logdir=lightning_logs/
  • TODO:
    • Adapt this code to train a linear embedding that takes as inputs 1-hot encoding of digits 0 to 9, and outputs a 2D-embedding. Then train this embedding with a triplet loss in order to shape the embedding space so that the digits appear ordered in the embedding space. Plot the embedding space with matplotlib.