If you didn’t received any personal correction, or missed any MCQ, send me an email now saying “I didn’t received any correction” or “I didn’t pass MCQ because…”
cerisara@loria.fr
Q001 | Q002 | Q003 | Q004 | Q005 | Q006 | Q007 | Q008 | Q009 |
36% | 29% | 61% | 50% | 11% | 89% | 43% | 57% | 93% |
Q010 | Q011 | Q012 | Q013 | Q014 | Q015 | Q016 | Q017 | Q018 |
32% | 86% | 46% | 82% | 57% | 25% | 21% | 79% | 32% |
1 * Which sentence is wrong? + The LLOD has proposed the RDF specifications - The LLOD is the largest subsection of the LOD - The LLOD promotes a cloud
2 * Which sentence is wrong? + URI make resources accessible - URI was previously known as UDI - An ISBN is an URI
3 *[horiz] Which standardization body has standardized RDF? + W3C - ISO - LOD
4 * If there is no license on a web page, it does imply: - that the text is free to use - by default, the CC0 (Creative Commons Zero) license + that you can not copy it on your website
5 *[horiz] Which organism standardized TEI? - W3C - CLARIN + ISO
6 *[horiz] Which format is better suited to represent phone-to-text alignment? - CoNLL + TextGrid - RDF
7 * The main role of a concept registry is to provide: - definitions of grammatical concepts + an URI for concepts - accessibility to TEI concept
8 * Which sentence is true: - LMF extends TEI to morphology, MWE, syntax... + TEI is more general than LMF - TEI requires references to a concept registry
9 *[horiz] Which repository mainly delivers free lexical resources: + ORTOLANG - ELRA - LDC
10 * Which sentence is true: - Some LLMs already capture all textual knowledge available on the internet + LLMs capture more and more internet texts, at a speed that is faster than the increase of textual content on the internet over time - LLMs capture more and more internet texts, but the amount of texts on the internet grows faster
11 * Which view in TEI best encodes syntactic relations? + lexical - typographic - editorial
12 * In the CLARIN network, the VLO enables to: + search in the available resources - compute statistics about the usage of each resource - observe and report on the evolution of lexical resources worldwide
13 * The CoNLL format is commonly used to encode: - abstract meaning representations + syntactic trees - common sense knowledge
14 *[horiz] Simple text files do not require intense preprocessing to extract a lexicon. - True! + False!
15 * Which one corresponds to a correct lexicon creation pipeline? - text format > tokenization > frequencies > ngram - text format > cleaning > lexicon extraction > ngram + text format > cleaning > ngram > frequencies
16 * Why XML to JSON conversion yields some challenges? - Due to the data hierarchy being different! + Because JSON files possess less information - Because XML can be parsed as YAML
17 * A lexicon cannot incorporate morphological rules. - True! + False!
18 * A lexeme is defined by its... - canonical form! - flexional form! + meaning!
The slides often give examples of code: start python right now on your laptop and try to run each command on your computer as they are given.
“zoo” has only one sense, so one lemma, which belongs to the synset {menagerie, zoo, zoological garden}
Find the gloss of “zoo”
How many synsets are there for “WordNet” ?
What synonyms does the noun “table” have ?
from nltk.corpus import wordnet as wn
zoo_synsets = wn.synsets("zoo")
len(zoo_synsets)
wn.synsets("try",pos=wn.NOUN)
wn.synsets("try",pos=wn.VERB)
wn.synsets("dry",pos=wn.NOUN)
wn.synsets("dry",pos=wn.VERB)
wn.synsets("dry",pos=wn.ADJ)
d1 = wn.synsets("dry",pos=wn.ADJ)[0]
d1.name()
d1.lemmas()
d1.definition()
d1.examples()
wn.synset("zoo.n.01")
wn.synset("menagerie.n.02")
wn.lemma("dry.a.01.dry").antonyms()
wn.lemma("dry.a.01.dry").name()
wn.lemma("dry.a.01.dry").count()
wn.lemma("dry.a.01.dry").derivationally_related_forms()
Hypernyms and hyponyms (80% of wordnet relations)
red, blue = co-hyponyms
cat = wn.synset('cat.n.01')
man = wn.synset('man.n.01')
cat.hypernyms()
cat.root_hypernyms()
man.hyponyms()
Compare:
wn.lemma('dry.a.01.dry').antonyms()
wn.synset('cat.n.01').hypernyms()
00 adj.all all adjective clusters
01 adj.pert relational adjectives (pertainyms)
02 adv.all all adverbs
03 noun.Tops unique beginner for nouns
04 noun.act nouns denoting acts or actions
05 noun.animal nouns denoting animals
06 noun.artifact nouns denoting man-made objects
…
man.lexname()
cat.lexname()
wn.all_synsets()
Going from a synset to the hypernym root:
cat.hypernym_paths()
Intersection between two hierarchies:
cat.lowest_common_hypernyms(bird)
cat.common_hypernyms(bird)
hypo = lambda x: x.hyponyms()
cat.tree(hypo)
wn.synset('concrete.a.01').tree(lambda s:s.also_sees(),depth=2)
hypo = lambda x: x.hyponyms()
for x in cat.closure(hypo): print(x)
cat.min_depth()
cat.max_depth()
= The length of the shortest (longest) hypernym path from this synset to the root.
cat.hypernym_distances()
Sort these distances:
sec = lambda x: x[1]
sorted(cat.hypernym_distances(),key=sec)
cat.path_similarity(man)
cat.wup_similarity(man)
cat.lch_similarity(man)
cat.shortest_path_distance(man)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
71% | 71% | 89% | 43% | 57% | 50% | 57% | 61% | 75% | 86% |
11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
46% | 64% | 79% | 64% | 82% | 32% | 61% | 18% | 39% |
1 [horiz] A vocabulary contains 400 words; a representation of these words in the form of a 400-dimensional vector is proposed. It is likely to be: + a one-hot vector embedding - a BERT embedding - a Word-to-vec embedding
2 [horiz] You have a vocabulary of $N$ words; you want to encode them into vectors so that the distances between every pair of words is the same. Which method do you choose? + one-hot encoding - Glove - BERT
3 Glove is an embeddings method that: + is trained on co-occurrence matrices - is based on neural networks - combines LDA probabilistic embeddings with W2V
4 [horiz] Which library does not natively support transformers? + scikit-learn - JAX - pytorch
5 [horiz] Which method is the best to encode sentence semantics? - Doc2Vec - Skip-thought + Sentence-BERT
6 Random Indexing: - randomly indexes the U matrix that has been computed from the SVD decomposition as in LSA + can take into account new documents even after all the training corpus has been deleted - give better results with low-dim than with high-dim vectors
7 Collobert embeddings are trained with the objective: - generate context words + multiple standard NLP tasks (NER, POStags...) - optimize Natural Language Inference task
8 About Word-to-vec embeddings, which statement is true: - they rely on sub-word units - their training objective is: predict the following word + they are fast to train
9 Which model is not commonly used for language modeling ? - transformers - Multi-Layer Perceptron - recurrent neural network + logistic regression
10 [horiz] Which model does not rely on transformers ? + ELMo - BERT - GPT1
11 What is the best approach to compute sentence embeddings? + use BERT plus contrastive learning with sentence pairs - average every word embedding in the sentence - use xl-net
12 [horiz] Which LM requires more parameters as the length of the sentence/context increases? + feed-forward network - RNN - transformers
13 [horiz] Who invented the term "word embedding"? - Mikolov + Bengio - Collobert
14 Let's consider 2 sentences $A=$"the work is sound" and $B=$"the sound is loud"; you compute the BERT embedding of the word *sound* in $A$ and $B$ respectively as $A_s$ and $B_s$; and the embedding of the work *good* in $C=$"it's a good job" as $C_g$. What are the most likely pair of cosine similarities? + $d(A_s,C_g)=0.8$ and $d(B_s,C_g)=0.7$ - $d(A_s,C_g)=0.6$ and $d(B_s,C_g)=0.8$ - $d(A_s,C_g)=0.1$ and $d(B_s,C_g)=0.1$
15 [horiz] What is not an implementation of Distributional Semantics? - Word2Vec - LSA + one-hot vectors
16 [horiz] Which method is the fastest to train? - LSA + Random Indexing - Word2Vec
17 [horiz] (TD) Byte Pair Encoding can be trained using a Negative Likelihood Loss - True + False
18 (TD) To correctly apply cosine distance on two BERT embeddings I need to: - apply a dot product on the embeddings beforehand + finetune the BERT model on related task - normalize the two embeddings beforehand
19 (TD) Supervised similarity measures for embeddings can be obtained + by training an auto-encoder - by applying a PCA to find the most relevant components - by using the Jaccard similarity
“After encouraging them, he told them goodbye and left for Macedonia”
After encouraging them, he told them goodbye and left for Macedonia
After encouraging them, he told them goodbye and left for Macedonia
from nltk.corpus import framenet as fn
fn.frame('Motion')
f=fn.frame(7)
frame(fn_fid_or_fname, ignorekeys=[]) method of nltk.corpus.reader.framenet.FramenetCorpusReader instance
Get the details for the specified Frame using the frame's name
or id number.
fn.frames()
fn.frames(r'(?i)crim') # case-insensitive regex
frames(name=None) method of nltk.corpus.reader.framenet.FramenetCorpusReader instance
Obtain details for a specific frame.
fn.lus()
fn.lu(4896)
f.FE
f.frameRelations
fn.fe_relations()
f.frameRelations[0]._type
fn.fe_relations()[0]._type
print(f.keys())
fn.docs()
fn.docs_metadata()
fn.doc(id)
fn.doc(6).sentence[0]
fn.doc(6).sentence[0].annotationSet[0]
fn.exemplars(frame='Motion')
fn.exemplars('run')
fn.frames('Motion')
fn.lus('xpress')
g=fn.lu(5372)
g.ID
g.definition
g.name
g.frame
g.frame.name
help(fn)