Huggingface

C. Cerisara

14-10-2023

prerequisites:
- You should have installed on your laptop:
  - python3, pytorch, the transformers library and the datasets library.

Overfitting

In this lab, you’re going to fine-tune a pretrained language model and study overfitting, its impact and how to prevent it.

Download the model

You’re going to use a “small” version of the pretrained GPT2 language model, which has only 82 million parameters and should fit in your laptop. Here is the code to download it:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

Now inspect what the model is composed of with the following code, and try to match what you see with the diagram in the course:

for name, param in model.named_parameters():
    print(name,param.size())

Predict with GPT2

First, you’re going to check that the model works. It’s a basic language model, so you may consider it as a blackbox in which you input the beginning of a sentence, and it outputs the most likely following word. You could then concatenate this following word at the end of your input, and iterate this process to progressively generate a complete paragraph. This iterative process is already implemented in a method of the “transformers” library called “generate()”.

Hence, try to see what is the most likely continuation of the sentence: “In winter, the weather is getting …”, according to GPT2, with the following code:

s = tokenizer.encode('In winter, the weather is getting',return_tensors='pt')
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

What is this code doing ?

tokenizer.encode() converts an English sentence into a sequence of tokens
model.generate() iteratively generates a sequence of tokens that might likely follow the input
tokenizer.decode() converts back this new sequence into plain English

The “generate()” method is useful because it iteratively generates a complete paragraph. But we rather want to get the probability of a single token/word that immediately follows the input. (In the following, I will say word instead of token, because for what we’re going to do, they’re just the same). You can do that by calling model() instead of model.generate():

outputs = model(s)
print(type(outputs))
print(outputs.logits.size())

When called this way, the model returns an object “outputs” that contains lots of information. For instance, “outputs.logits” contains an array with the score of every possible following word. You can see that this array has 3 dimensions, respectively of size 1, 7 and 50257:

“1” represents the number of input sentences that you have given to the model; there’s just one, but you could give it several sentences that it can process in parallel.
7 is the number of tokens that compose this input sentence: GPT2 is computing the possible continuation after every single word in the input sentence ! So in your case, you can get the possible continuations after “In”, “In winter,”In winter,“,”In winter, the”…
50257 is the size of the vocabulary of the GPT2 model

You can thus see the score given to a particular continuation token after the complete input sentence:

print(outputs.logits[0,6,22312])

Fine ! But it’s not very easy to interpret when it’s not in plain English… So let’s find out what is the token for the word “warmer”:

ss = tokenizer.encode('warmer',return_tensors='pt')
labels = ss[0][0].view(1)
print("token for warmer:",labels)

You can see that “warmer” corresponds to token 5767, and so you can get it’s score:

print(outputs.logits[0,6,5767])

This way, you can compare the scores of various words: try to compare the score for “warmer” and “colder”: is “colder” really much more likely than “warmer” ?

Spoiler: the difference in logits should be around 4.3 absolute

Finetune GPT2

Next, you’re going to tine-tune the model to a surreal world where the winter is warm and the summer is cold. To do that, you’re going to give the model the ground truth continuation “warmer” and finetune the model so that it may output “In winter, the weather is getting warmer”.

Here is the code to finetune the model:

import torch
lasty = outputs.logits[0,-1].view(1,-1)
lossfct = torch.nn.CrossEntropyLoss()
loss = lossfct(lasty,labels)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
loss.backward()
optimizer.step()
optimizer.zero_grad()

Let’s look at what this code is doing:

“lasty” contains the scores of all the words predicted after “In winter, the weather is…”
“lossfct” is the function that computes the error between the predicted word and the ground truth
“loss” is the actual error: it boils down computing the difference between “lasty” and the ground truth in “labels”, i.e., the word “warm”
“optimizer” is the algorithm to finetune the model’s parameters
“loss.backward()” computes by how much should the model’s parameters be changed to decrease the error
“optimizer.step()” updates the model’s parameters
“optimizer.zero_grad()” resets the errors computed previously

You can now check that the model’s predictions are not the same as before:

y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

(Optional:) Test with various learning rates:
- which LR has no impact ?
- which LR overfits ?
- which LR works fine ?

Prompting

As explained in the course, exploiting a pretrained model to solve NLP tasks in a Zero-Shot way is only possible with large enough models. You will use next the smallest bloom model.

If you laptop memory has enough RAM, then you can download bloom locally and use it in your code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/bloom-560m', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
s = "input text"
inputs = tokenizer.encode(s, return_tensors="pt")
outputs = model.generate(inputs)
answer = tokenizer.decode(outputs[0])

If you laptop is short on RAM, or if you want to load a larger model, you may quantize the model into 4 bits; this divides the required RAM by 4 compared to bf16. There are several ways to quantize:
- The easiest is to use the bitsandbytes library: you just have to add the option “load_in_4bit=True” when loading the pretrained model. But this option only works with a GPU.
- another one is to convert a model into 4bits before using it: see for instance GPTQ library. But it’s much more difficult to use with transformers.
- another one is to download an already existing 4-bits model: see the GGUF file format, which is compatible with the llama.cpp code, which is a very fast and low-memory program to run large LLM on the CPU or GPU. Alternative libraries exist: see for instance ollama
- Note that 4-bits models cannot be trained or finetuned directly (there exists research papers that do train a model in 8 bits, or even in 1bit ! But these are advanced research methods that are not integrated in well-known libraries). However, huggingface propose the PEFT library that is compatible with transformers to rather train a small amount of additional parameters combined with a 4-bit model (see qLoRA). This method gives performances that are comparable with standard finetuning.

Natural Language Inference

Given 2 sentences respectively called the premise and hypothesis, the NLI task aims at detecting whether the hypothesis can be inferred from the premise, contradicts it, or is unrelated. For instance,

“The fourth-century Roman emperor Gratianus was an early visitor” implies “Gratianus was a Roman emperor.”
“She smiled back.” and “She was so happy she couldn’t stop smiling.” are unrelated
“Mrs. Cavendish is in her mother-in-law’s room.” contradicts “Mrs. Cavendish has left the building.”

Although it has not been trained on NLI, bloom-560m should be able to solve this task without any supervision. But you have to input bloom-560m a good prompt to get the correct answer. A good prompt for NLI may follow the pattern:

{{Premise}}
Based on the previous passage, is it true that "{{Hypothesis}}"? Yes, no, or maybe?

Manual test

Test manually with the following examples whether bloom-560m is able to perform NLI (you need to write a python code to run these tests):
- premise: Two women are hugging each other.
- hypothesis: Two women are showing affection.
- premise: A man is running the coding example.
- hypothesis: The man is sleeping.
- premise: The musicians are performing for us.
- hypothesis: The musicians are famous.

Evaluation in python

You’re going to evaluate the quality of this method (prompt and model) on the 10 first examples from the GLUE MNLI corpus. The following code downloads the GLUE MNLI corpus, queries bloom-560m, compares the returned string with the true label for each of the 10 first sentences and computes the accuracy:

import datasets
d=datasets.load_dataset("Jikiwa/glue-mnli-train")['validation']
# 0=entail 1=neutral 2=contredict

nok,ntot = 0,0
for i in range(10):
  h = d[i]['hypothesis']
  p = d[i]['premise']
  label = int(d[i]['label'])
  s = p+'\n'+'Based on the previous passage, is it true that "'+h+'"? Yes, no, or maybe?'
  print(s)

  # TODO: get bloom's answer in rr

  if 'yes' in rr: rep=0
  elif 'no' in rr: rep=2
  elif 'maybe' in rr: rep=1
  else: rep=-1
  if rep<0:
      print("ERROR",rr)
  else:
      if rep==label: nok+=1
      ntot+=1
      acc=float(nok)/float(ntot)
      print("acc",acc)

Try and run this code to get the accuracy (Spoiler: you should get around 70% of accuracy)

Comparison with BART-large-MNLI

We have performed NLI with the bloom model, which has never been trained to do NLI; such Zero-Shot Learning performances are only possible with large pretrained models, which have captured enough various information from the web.

You may wonder how does it perform, compared to a smaller pretrained model that has been specifically fine-tuned on the NLI task. Such a model at huggingface is “bart-large-mnli”. The objective is to compute the classification accuracy obtained by bart-large-mnli on the same 10 validation sentences from the MNLI corpus. Note that bart-large-mnli has been fine-tuned on this very same corpus, so its performances should be excellent !

But contrary to bloom, bart-large-mnli is not a text generation model, and so it cannot generate the answer “in English”. Instead, it’s a classification model that directly outputs the classification scores for the three classes “entail / neutral / contradict”. So it must be used differently in python, as follows:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

premise = 'The fourth-century Roman emperor Gratianus was an early visitor'
hypothesis = 'Gratianus was a Roman emperor.'

x = tokenizer.encode(premise, hypothesis, return_tensors='pt', truncation_strategy='only_first')
logits = nli_model(x)[0]

score_infer = logits[0,2].item()
score_neutral = logits[0,1].item()
score_contredict = logits[0,0].item()

Adapt this code and the previous one to compute the accuracy of bart-large-mnli on the 10 first validation sentences from the MNLI corpus.

(Spoiler: you should get around 90% accuracy)

Multi-lingual

Even though English LLM are mainly trained on English texts, they are also skilled in other languages. How is it possible? Simply because “English” pretrained models have actually been pre-trained on so large web corpora that, despite the efforts that may have been made to only keep training data in English, there is still a non-negligeable proportion of data in foreign languages that have leaked into these models. We may thus wonder whether these models have captured some non-English information.

Spoiler: the answer is yes.

You can easily check that by querying bloom and see if it’s able to translate a sentence; you may use one of the method explained in Section @sec:manual-test to query bloom with the following prompt:

French: Le sens de cette phrase dépend de la personne qui l'écoute
English:

or another example:

Is the sentence "ce n'est vraiment pas une bonne chose" positive or negative?

Fine, we have seen on a few examples that, even though bloom has been trained only on an “English” corpus, it is also able to capture semantics in other languages. But how good at it is it really?

The only way to answer this question is to evaluate bloom on a multi-lingual benchmark!

We’re going to use the XGLUE benchmarks, more specifically the “NC” part of it, which deals with topic classification in multiple languages. However, please do not download the dataset from the huggingface website, as it is very slow! Please rather download on your computer a smaller version of the French validation that I have compiled and is available here: https://olki.loria.fr/cerisara/lexres/xglue.20.txt.zip. This is a text file: each line of the file contains one sample plus the gold topic appended at the end of the line as a digit, which is the index of the topic in the topic list below, starting from 1.

Warning: this text file is encoded in UTF-8: if you’re on windows or Mac, you may want to use the following python command to open it:

with open("xglue.20.txt","r",encoding="utf-8") as f:
    for l in f:
        l=l.strip()
        body=l[0:-1].replace('/',' ')
        gold_topic=int(l[-1:])

We’re going to leverage Zero-Shot Learning for that, as explained in a previous Section. The only difference is basically the corpus used, which is multi-lingual, and the task, which is topic classification.

In order to handle this task, we’re going to use the following prompt:

"{{body}}", given a list of categories: "sports, travel, finance, lifestyle, news, entertainment, health, video or autos", what category does the paragraph belong to?

What is the topic classification accuracy on the first 20 examples from XGlue-FR-validation?

spoiler: you should get around 45%