Exercice Words Embeddings with FastText

Requirements for this session

This Exercice does not require any python programming: the goal is rather to make you use ready-to-use NLP software to manipulate words Embeddings. We will use next the FastText software. Note that everything that is done next could have also been done with other software, such as SpaCy, NLTK, gensim…

You won’t have to program anything in this exercice, but you first need to install FastText on your computer: please read the documentation about how to install FastText here . There is a special page for Windows users here .

Embeddings with FastText

We will use the fastText software, which is written in C, although if you really want to, a python implementation is officially supported. But we will use the C command-line version here, because it is blazingly fast and requires much less resource power (memory, CPU) than the python version. You will have to run the commands in a terminal, and eventually write your scripts in a text editor

This tutorial is based on this FastText tutorial.

Check-up

You should have a running fasttext binary in the current directory: check with executing:

./fasttext

If for some reason it does not work, then you can download and install it following https://fasttext.cc/.

Training words embeddings

We’re going to see how to train Word Embeddings with fasttext. It is actually pretty simple: all that is required is a file that contains a tokenized text, with one document per line. We already have such a corpus, extracted from the Movie Review corpus, in MR500train; if you haven’t already, download and uncompress the corpus from here.

In our file, each line indeed contains a document, plus the sentiment label that is given first on each line: 0 for a negative sentiment and 1 for a positive sentiment. So we just need to remove this sentiment label to get a file from which fasttext can train word embeddings. Removing this label can be done in python, but here is a simpler way to do it using the linux command cut:

cut -c2- MR500train > ft.train

Just execute the command above in the terminal, and you’ll get a file called ft.train without the label. For information, the option “-c2-” tells cut to only print, for every line, the characters at position 2 and more. If you are not on linux, you may write the python script that does the same thing.

Note for windows OS: because of different default character encodings, I recommend that you first open and rewrite the data files with a native application, such as WordPad. Otherwise, you may end-up with a character-level embeddings instead of word-level embeddings because Windows may not recognize correctly the words separators.

Now, we can train FastText skipgram embeddings with the command:

./fasttext skipgram -input ft.train -output ft.emb

This results in two files: ft.emb.bin, which stores the whole fastText model and can be subsequently loaded, and ft.emb.vec that contains the word vectors, one per line for each word in the vocabulary.

Using this model file, we can print word vectors as follows:

echo "table cat dog" | ./fasttext print-word-vectors ft.emb.bin

Question 1

(please write your answers in a text file)

Just by looking at the ft.emb.vec file, can you tell whether any of the 3 words “table, cat and dog” is out of the vocabulary ?
If so, how is it possible for FastText to still output some word vector ?

Question 2

You may have noticed that, after training, the loss is printed: it measures the training error of the model, so it’s better to decrease this loss (without exagerating so that to limit overfitting). The easiest options to improve the loss is by playing with the following command line options:

learning rate (-lr)
size of the context window (-ws)
number of iterations over the corpus (-epoch)

you can see all options with:

./fasttext skipgram

Try and test manually a few different values for these options and report the resulting losses.

Closest semantic words

We now want to find the most semantically similar words from a target word. This can be achieved by the nn command of fasttext (nearest neighbor) as follows:

./fasttext nn ft.emb.bin

This command is interactive and it will ask you for a target word.

Question 3

Give the 15 closest words from the target word: actor

Analogies

W2V embeddings are known to be able to deduce that King - Man + Woman is Queen.

In the terminology of fasttext, this is called computing analogies, because this can be interpreted as:

A king is to a man what a queen is to a woman

You can compute such analogies as follows:

./fasttext analogies model.bin

Question 4

Because we have trained in the movie review domain, try and find the analogy with the three words (find a meaningful order): talent actor actress

How can you interpret the result ?

Text classification

The most useful application of words embeddings is for text classification, because they represent words in a vectorial form that can be easily manipulated by neural networks, and because they enable transfer learning through pretrained word vectors, i.e., transfering lexical semantic information that has been captured from a large raw text corpus into our specific classification model trained on our small corpus for a given task.

So fasttext naturally provides options to train a simple, yet powerful linear classifier on top of words embeddings. Hence, with the supervised command, fasttext will train both the words embeddings and the linear classifier on top to perform some given task.

Let’s consider the task represented by our Movie Review corpus: sentiment analysis, which consists in predicting whether a film review is positive or negative.

The file input format required by fasttext to train a classifier is nearly the same as our file format MR500train. The only difference comes from the fact that, in fasttext, the label of the text (0 or 1) should be concatenated to the prefix string “label”. We can transform our corpus easily with the awk linux command:

awk '{print "__label__"$0}' MR500train > cl.train

Check that the output file cl.train has the correct format.

Now we can train a fasttext model with:

./fasttext supervised -input cl.train -output mod.tr -label __label__ -epoch 100

We can similarly test our model with:

./fasttext test mod.tr.bin cl.train

Question 5

Evaluate the model on the test set instead of the train set.

using pretrained models

Fasttext provides pretrained models that have been trained on a given supervised dataset. You can find them here. For instance, for sentiment analysis, they provide models that have been trained on the YELP dataset, composed of reviews of restaurants, bars… Note that these pretrained models are very small, because fasttext exploits a very efficient compression method for them.

Before being able to apply these models on our dataset, we first need to modify a bit our dataset. Indeed, the YELP model outputs 1 and 2 instead of 0 and 1, so we need to modify accordingly our test set:

awk '{print "__label__"($1+1)" "$0}' MR500test | cut -d' ' -f1,3- > cl.test2

Now we can just try them out without retraining and test them on our corpus with:

./fasttext test yelp_review_polarity.ftz cl.test2

Question 6

What is the accuracy of these pretrained models for our task ?