Huggingface

C. Cerisara

14-10-2023

Overfitting

In this lab, you’re going to fine-tune a pretrained language model and study overfitting, its impact and how to prevent it.

Download the model

You’re going to use a “small” version of the pretrained GPT2 language model, which has only 82 million parameters and should fit in your laptop. Here is the code to download it:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

Now inspect what the model is composed of with the following code, and try to match what you see with the diagram in the course:

for name, param in model.named_parameters():
    print(name,param.size())

Predict with GPT2

First, you’re going to check that the model works. It’s a basic language model, so you may consider it as a blackbox in which you input the beginning of a sentence, and it outputs the most likely following word. You could then concatenate this following word at the end of your input, and iterate this process to progressively generate a complete paragraph. This iterative process is already implemented in a method of the “transformers” library called “generate()”.

Hence, try to see what is the most likely continuation of the sentence: “In winter, the weather is getting …”, according to GPT2, with the following code:

s = tokenizer.encode('In winter, the weather is getting',return_tensors='pt')
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

What is this code doing ?

The “generate()” method is useful because it iteratively generates a complete paragraph. But we rather want to get the probability of a single token/word that immediately follows the input. (In the following, I will say word instead of token, because for what we’re going to do, they’re just the same). You can do that by calling model() instead of model.generate():

outputs = model(s)
print(type(outputs))
print(outputs.logits.size())

When called this way, the model returns an object “outputs” that contains lots of information. For instance, “outputs.logits” contains an array with the score of every possible following word. You can see that this array has 3 dimensions, respectively of size 1, 7 and 50257:

You can thus see the score given to a particular continuation token after the complete input sentence:

print(outputs.logits[0,6,22312])

Fine ! But it’s not very easy to interpret when it’s not in plain English… So let’s find out what is the token for the word “warmer”:

ss = tokenizer.encode('warmer',return_tensors='pt')
labels = ss[0][0].view(1)
print("token for warmer:",labels)

You can see that “warmer” corresponds to token 5767, and so you can get it’s score:

print(outputs.logits[0,6,5767])

This way, you can compare the scores of various words: try to compare the score for “warmer” and “colder”: is “colder” really much more likely than “warmer” ?

Spoiler: the difference in logits should be around 4.3 absolute

Finetune GPT2

Next, you’re going to tine-tune the model to a surreal world where the winter is warm and the summer is cold. To do that, you’re going to give the model the ground truth continuation “warmer” and finetune the model so that it may output “In winter, the weather is getting warmer”.

Here is the code to finetune the model:

import torch
lasty = outputs.logits[0,-1].view(1,-1)
lossfct = torch.nn.CrossEntropyLoss()
loss = lossfct(lasty,labels)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
loss.backward()
optimizer.step()
optimizer.zero_grad()

Let’s look at what this code is doing:

You can now check that the model’s predictions are not the same as before:

y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

Prompting

As explained in the course, exploiting a pretrained model to solve NLP tasks in a Zero-Shot way is only possible with large enough models. You will use next the smallest bloom model.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/bloom-560m', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
s = "input text"
inputs = tokenizer.encode(s, return_tensors="pt")
outputs = model.generate(inputs)
answer = tokenizer.decode(outputs[0])

Natural Language Inference

Given 2 sentences respectively called the premise and hypothesis, the NLI task aims at detecting whether the hypothesis can be inferred from the premise, contradicts it, or is unrelated. For instance,

Although it has not been trained on NLI, bloom-560m should be able to solve this task without any supervision. But you have to input bloom-560m a good prompt to get the correct answer. A good prompt for NLI may follow the pattern:

{{Premise}}
Based on the previous passage, is it true that "{{Hypothesis}}"? Yes, no, or maybe?

Manual test

Evaluation in python

You’re going to evaluate the quality of this method (prompt and model) on the 10 first examples from the GLUE MNLI corpus. The following code downloads the GLUE MNLI corpus, queries bloom-560m, compares the returned string with the true label for each of the 10 first sentences and computes the accuracy:

import datasets
d=datasets.load_dataset("Jikiwa/glue-mnli-train")['validation']
# 0=entail 1=neutral 2=contredict

nok,ntot = 0,0
for i in range(10):
  h = d[i]['hypothesis']
  p = d[i]['premise']
  label = int(d[i]['label'])
  s = p+'\n'+'Based on the previous passage, is it true that "'+h+'"? Yes, no, or maybe?'
  print(s)

  # TODO: get bloom's answer in rr

  if 'yes' in rr: rep=0
  elif 'no' in rr: rep=2
  elif 'maybe' in rr: rep=1
  else: rep=-1
  if rep<0:
      print("ERROR",rr)
  else:
      if rep==label: nok+=1
      ntot+=1
      acc=float(nok)/float(ntot)
      print("acc",acc)

Comparison with BART-large-MNLI

We have performed NLI with the bloom model, which has never been trained to do NLI; such Zero-Shot Learning performances are only possible with large pretrained models, which have captured enough various information from the web.

You may wonder how does it perform, compared to a smaller pretrained model that has been specifically fine-tuned on the NLI task. Such a model at huggingface is “bart-large-mnli”. The objective is to compute the classification accuracy obtained by bart-large-mnli on the same 10 validation sentences from the MNLI corpus. Note that bart-large-mnli has been fine-tuned on this very same corpus, so its performances should be excellent !

But contrary to bloom, bart-large-mnli is not a text generation model, and so it cannot generate the answer “in English”. Instead, it’s a classification model that directly outputs the classification scores for the three classes “entail / neutral / contradict”. So it must be used differently in python, as follows:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

premise = 'The fourth-century Roman emperor Gratianus was an early visitor'
hypothesis = 'Gratianus was a Roman emperor.'

x = tokenizer.encode(premise, hypothesis, return_tensors='pt', truncation_strategy='only_first')
logits = nli_model(x)[0]

score_infer = logits[0,2].item()
score_neutral = logits[0,1].item()
score_contredict = logits[0,0].item()

Adapt this code and the previous one to compute the accuracy of bart-large-mnli on the 10 first validation sentences from the MNLI corpus.

(Spoiler: you should get around 90% accuracy)

Multi-lingual

Even though English LLM are mainly trained on English texts, they are also skilled in other languages. How is it possible? Simply because “English” pretrained models have actually been pre-trained on so large web corpora that, despite the efforts that may have been made to only keep training data in English, there is still a non-negligeable proportion of data in foreign languages that have leaked into these models. We may thus wonder whether these models have captured some non-English information.

Spoiler: the answer is yes.

You can easily check that by querying bloom and see if it’s able to translate a sentence; you may use one of the method explained in Section @sec:manual-test to query bloom with the following prompt:

French: Le sens de cette phrase dépend de la personne qui l'écoute
English:

or another example:

Is the sentence "ce n'est vraiment pas une bonne chose" positive or negative?

Fine, we have seen on a few examples that, even though bloom has been trained only on an “English” corpus, it is also able to capture semantics in other languages. But how good at it is it really?

The only way to answer this question is to evaluate bloom on a multi-lingual benchmark!

We’re going to use the XGLUE benchmarks, more specifically the “NC” part of it, which deals with topic classification in multiple languages. However, please do not download the dataset from the huggingface website, as it is very slow! Please rather download on your computer a smaller version of the French validation that I have compiled and is available here: https://olki.loria.fr/cerisara/lexres/xglue.20.txt.zip. This is a text file: each line of the file contains one sample plus the gold topic appended at the end of the line as a digit, which is the index of the topic in the topic list below, starting from 1.

Warning: this text file is encoded in UTF-8: if you’re on windows or Mac, you may want to use the following python command to open it:

with open("xglue.20.txt","r",encoding="utf-8") as f:
    for l in f:
        l=l.strip()
        body=l[0:-1].replace('/',' ')
        gold_topic=int(l[-1:])

We’re going to leverage Zero-Shot Learning for that, as explained in a previous Section. The only difference is basically the corpus used, which is multi-lingual, and the task, which is topic classification.

In order to handle this task, we’re going to use the following prompt:

"{{body}}", given a list of categories: "sports, travel, finance, lifestyle, news, entertainment, health, video or autos", what category does the paragraph belong to?

What is the topic classification accuracy on the first 20 examples from XGlue-FR-validation?

spoiler: you should get around 45%