Lexical Resources

Christophe Cerisara

2022/2023

Lexical Resources: introduction

Course plan (indicative !)

CM	Topic
08/12	intro: get/build Lex Res
15/12	Overview of embeddings
19/01	WordNet + FrameNet
26/01	VerbNet + PropBank
02/02	Transformers 1
09/02	Transformers 2

Course requirements:

Basics of python
Access to a computer (in & outside class)
- With python + numpy + scipy + nltk installed
- Internet access in & outside class
Any question:
- cerisara@loria.fr
- slides: https://members.loria.fr/CCerisara/#courses/lexical_resources/

Content of the course

How to create (transform) LR
- ~~Manually: annotation guides, quality…~~
- Automatically
  - ~~Scrap/curate texts, RDF extraction…~~
  - Text processing: Ngram, embeddings
How to use LR
- From XML files (~~SPARQL…~~)
- From python & NLTK / words embeddings

Today’s concepts

Motivation, concepts about LR
Where to get LR from
How to process LR; example: Wiktionary

Concepts

What is a lexical resource ?

Axel Herold: “collection of lexical items, together with linguistic information and/or classification of these items”
lexical items = words, multi-words, sub-word units
dictionary = usable by humans
LR = usable by machines
data: spelling, phonetics, category, relations (semantics…)

Usable by machines ?

Ideally: standard formats (XML, TEI, RDF…)
In practice:
- ad-hoc formats (CoNLL…)
Examples: CoNLL, TextGrid, TEI, LMF

Formats

CoNLL example

1    They     they    PRON    PRP    Case=Nom|Number=Plur               2    nsubj    2:nsubj|4:nsubj
2    buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0    root     0:root
3    and      and     CONJ    CC     _                                  4    cc       4:cc
4    sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2    conj     0:root|2:conj
5    books    book    NOUN    NNS    Number=Plur                        2    obj      2:obj|4:obj
6    .        .       PUNCT   .      _                                  2    punct    2:punct

TextGrid example

    item [1]:
       class = "IntervalTier"
       name = "sentence"
       xmin = 0
       xmax = 2.3
       intervals: size = 1
       intervals [1]:
          xmin = 0
          xmax = 2.3
          text = "říkej ""ahoj"" dvakrát"
    item [2]:
       class = "IntervalTier"
       name = "phonemes"
       xmin = 0
       xmax = 2.3
       intervals: size = 3
       intervals [1]:
          xmin = 0
          xmax = 0.7
          text = "r̝iːkɛj"
       intervals [2]:
          xmin = 0.7
          xmax = 1.6
          text = "ʔaɦɔj"

Text Encoding Initiative

different views of a dictionary:
- typographic view: “the two-dimensional printed page” (include page layout)
- editorial view: “one-dimensional sequence of tokens”
- lexical view: structured lexicographical information, may include lexical data not present in the text
TEI guidelines: encode one view with primary XML structure; another view in XML attributes
lexical view = most important for Lexical Resources, so encoded in XML structure

<entry xml:id="a_1">
    <form>
        <orth>Bahnhof</orth>
    </form>
    <gramGrp>
        <pos value="N" />
        <gen value="masculine" />
    </gramGrp>
    <sense>
        <def>...</def>
        <cit>
            <quote>der Zug fährt in den Bahnhof ein</quote>
        </cit>
    </sense>
    <!-- ... -->
    <sense>
        <def>...</def>
    </sense>
</entry>

“entry” = basic unit of information
- includes “form” and “sense”
“gramGrp” = grammatical information
- may also be put as a child of “form”, to make it depend on a specific form

reference to the CLARIN concept registry:

<gramGrp>
  <pos value="N" 
    dcr:datcat="http://hdl.handle.net/11459/CCR_C-5524_d8864ad4-1bdf-ee56-594e-784312129ea7"
    dcr:valueDatcat="http://hdl.handle.net/11459/CCR_C-3347_7face0f5-7a72-7ec2-c988-7adba256cea9" />
</gramGrp>

Lexical Markup Format

focus only on the lexical representation of data == lexical view of TEI
goal = meta-model for all types of NLP lexicons
reference to data category registry mandatory
Packages: core, morphology, MWE, syntax…

<LexicalEntry>
    <feat att="partOfSpeech" val="noun" />
    <feat att="gender" val="masculine" />
    <Lemma>
        <feat att="writtenForm" val="Bahnhof" />
    </Lemma>
    <WordForm>
        <feat att="writtenForm" val="Bahnhof" />
        <feat att="grammaticalNumber" val="singular" />
    </WordForm>
    <WordForm>
        <feat att="writtenForm" val="Bahnhöfe" />
        <feat att="grammaticalNumber" val="plural" />
    </WordForm>
    <Sense id="s1">
    </Sense>
    <!-- ... -->
    <Sense id="sn">
    </Sense>    
</LexicalEntry>

LMF only focuses on NLP, so much tighter than TEI
so easier for electronic lexicographic resources
XML serialization: RELISH, KYOTO, LIRICS…

Content of LR

Two types of LR

Manually built
- WordNet, FrameNet, VerbNet, PropBank, DBPedia, BabelNet, Wiktionary…
- Example: BDLEX
  - contains 440k inflected forms
  - source: Univ. Paul Sabatier
  - distributor: ELRA (2k€ non commercial)

Two types of LR

“Automatically” built
- Google Ngrams
- Words embeddings
- no XML any more !
Info indirectly accessible:
- diachronic usage of words
- lists of “function” words
- semantic relations between words
- phonetizer
- translation dictionaries…

Static vs. dynamic LR

MCQ time !

Availability of LR

Non-free Lex Res

Main distributors:

ELRA/ELDA (Europe)
- European companies/univ. (CNRS)
- 1400 resources
- CoNLL,
- very few free resources:
  - MLCC Multilingual and Parallel Corpora

Non-free Lex Res

LDC (Linguistic Data Consortium) (USA)
- funded by ARPA and NSF
- hosted by Univ. Pennsylvania
- catalogue: 100s of resources
  - TIMIT, Switchboard, Gigaword…

“Freely” available

Distributed by Univ./organisms
- EuroParl: https://www.statmt.org/europarl/
- CNRTL/ATILF:
  - TLFi, Morphalou, Frantext…
- ORTOLANG:
  - 500 resources: Dicovalence, patois de St-Martin-la-Porte…
- Isidore.science: 20 datasets
- …

International Networks

https://live.european-language-grid.eu/
- 800 links to resources (e.g. to ELRA…)
LINDAT
linghub.org
European Open Science Cloud
CLARIN
- VLO (Virtual Language Observatory): 800k resources

International Networks

Linguistic Linked Open Data (LLOD) initiative
- Principles:
  - Creative Commons licenses
  - Accessible via URI
  - RDF Standard
  - Linked Data

Uniform Resource Identifier

URI = unique identification
- may be URN or URL
example of URN:
- urn:isbn:0-486-27557-4
- Specific edition of Shakespeare’s play Romeo and Juliet
example of URL: http://example.org/wiki/Main_Page

Resource Description Framework

family of W3C specifications
based on triples:
- Subject: a resource
- Predicate: relation
- Object: another resource

LOD vision

linguistic-lod.org

Ex: Princeton WordNet

   <LexicalEntry id ='w44919'>
      <Lemma writtenForm='quantification' partOfSpeech='n'/>
      <Sense id='w44919_01003570-n' synset='eng-10-01003570-n'/>
      <Sense id='w44919_06165623-n' synset='eng-10-06165623-n'/>
   </LexicalEntry> [...]
   <Synset id='eng-10-01003570-n' baseConcept='3'>
      <Definition gloss="the act of discovering or expressing the quantity of something"> </Definition>
      <SynsetRelations>
         <SynsetRelation targets='eng-10-00996969-n' relType='hype'/>
         <SynsetRelation targets='eng-10-01003729-n' relType='hypo'/>
      </SynsetRelations>
   </Synset>

Collection of RDF triples

In theory = labeled directed multi-graph
In practice:
- Relational database
- or Triplestores
cf. course on Semantic Web
Tutorial on LOD

Development of LLOD

2016: OntoLex-Lemon vocabulary (W3C)
WebAnnotation (W3C)
SPARQL to access RDF
Summer Datathon on LLOD
H2020 project “Prêt-à-LLOD”
ELEXIS project “European Lexicographic Infrastructure”

Resources

(Chiarcos, IWLTP’2020):

“Working with RDF normally requires a certain level of technical expertise, i.e., basic knowledge of SPARQL and at least one RDF format.”

Resources

https://livebook.manning.com/book/linked-data/chapter-2/1

Building LR from texts

Automatic LR creation

Typical lexical information:
- forms + frequencies => usage
- diachronic usage => lexical drift
- co-occurrence => lexical semantics
- embeddings => synonyms, antonyms…
- relations => syntagmatic…
- combination => compositional semantics
- multi-word expressions
- decompose them => morphology

Automatic LR creation

Advantages
- Many (English) texts available (e.g. CommonCrawl)
- Reduced costs
- May be specialized to target domains
Drawbacks
- Requires NLP expertise
- Not as precise as a linguist

Quality ?

Size matters => must be computationally efficient

Text analysis

Challenges:
- Choose one or multiple sources of texts
- Scrap the texts (see other course)
- Preprocess texts (see other course)
- Extract lexical information

Sources of texts

Generic information or specialized domain (healthcare…) ?
Large variability in language:
- Casual: forums, conversations…
- Micro-blog
- Formal: books…
- Journalistic: news
- Educative: moocs, tutorials…
- …

Sources of texts

Have I the right to scrap the text ?

It’s not because it’s public that you can copy it !!

Check whether there is a license, like Creative Commons, otherwise: “all rights reserved”
Beyond legal aspects, more and more concerns about privacy & right to be forgotten
Anonymization does not guarantee privacy !

Sources of texts

Have I the right to scrap the text ?

Twitter provides API to download some data, but forbids you to keep them on your harddrive.
You cannot redistribute texts without explicit CC-BY licence

(Wait… doesn’t Google scrap the whole web since many years ?)

Sources of texts

Unsafe sources of texts:
- Social media
- Most web pages
Safe sources of texts:
- Wikipedia & derivatives (CC-BY)
- Scientific papers: arXiv, pubMed, HAL…
- Owners datasets: AskUbuntu archives, reddit archives, (Common Crawl), (WebTimeMachine)…
- Gutenberg, Gallica…

Sources of texts

There’s a lot more… inaccessible !

Scrapping texts

There are several ways to download corpora:

APIs: not standard, may change, heavy for servers
dumps archive (wikipedia, reddit…)
peer-to-peer (academic torrents)
OAI-PMH
…

See course on basic NLP techniques (Yannick Parmentier)

Pre-processing texts

Texts comes with metadata, in XML, JSON…
Use adequate parsers
Then:
- Filter out garbage (other language, errors…)
- Segmentation + tokenization
- Normalize (dates…)
- Compute features (POStags…)

See Claire Gardent’s course.

Example: wiktionary

Access dumps: https://dumps.wikimedia.org/frwiktionary/20210720/
available at https://olki.loria.fr/cerisara/lexres/frwik.xml.bz2
Goal: build a LR with words + definitions

First look at the data

how to quickly find the line number of a page:

bzcat fich.xml.bz2 | grep -n '<title>Nancy</title>'

look at the page:

bzcat fich.xml.bz2 | tail -n +12872178 | less

    <title>Nancy</title>
    <ns>0</ns>
    <id>271346</id>
    <revision>
      <id>29517974</id>
      <parentid>28848867</parentid>
      <timestamp>2021-06-10T00:19:11Z</timestamp>
      <contributor>
        <username>Lingua Libre Bot</username>
        <id>229398</id>
      </contributor>
      <comment>Ajout d'un fichier audio de prononciation depuis Lingua Libre</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="7282" xml:space="preserve">{{voir|nancy}}
== {{langue|fr}} ==

=== {{S|étymologie}} ===
: {{date|lang=fr|1073}} La première trace écrite de Nancy date du 29 avril 1073 (mention dans la charte de Pibon, évêque de Toul : « Olry, voué de Nancy » (« ''Odelrici advocati de Nanceio'' »). Le nom serait cependant d’origine celtique, car on le rapproche du gaulois ''{{lien|nantu-|gaulois}}''/''{{lien|nanto-|gaulois}}'', qui signifie « [[val]], [[vallée]] », ou de {{recons|lang-mot-vedette=fr|nantus|gaulois}} (« ruisseau »).

=== {{S|nom de famille|fr}} ===
'''Nancy''' {{pron|nɑ̃.si|fr}}
# Nom de famille.

=== {{S|nom propre|fr}} ===
{{fr-inv|nɑ̃.si|inv_titre=Nom propre}}
'''Nancy''' {{pron|nɑ̃.si|fr}}
# {{localités|fr|du département de la Meurthe-et-Moselle}} [[commune|Commune]], [[ville]] et [[chef-lieu de département]] [[français]], situé dans le département de la [[Meurthe-et-Moselle]].
#* '''''Nancy''' est une ville d’ordre et de lumières où dès le XVII{{e}} et le XVIII{{e}} siècle, des ducs intelligents furent, sans le savoir, les précurseurs heureux de nos urbanistes modernes.'' {{source|{{Citation/Ludovic Naudeau/La France se regarde/1931}}}}
#* ''La verrerie de '''Nancy''' est de fondation récente, puisqu’elle date de 1875 seulement.'' {{source|Gustave Fraipont; ''Les Vosges'', 1923}}

Using APIs

after pip install wikipedia

import wikipedia
print(wikipedia.summary("Nancy, France"))

Nancy is the capital of the northeastern French department of Meurthe-et-Moselle, the former capital of the Duchy of Lorraine, and then the French province of the same name. The metropolitan area of Nancy had a population of 511,257 inhabitants at the 2018 census, making it the 16th largest urban area in France and the Lorraine's largest. The population of the city of Nancy proper is 104,885.
The motto of the city is Non inultus premor, Latin for '"I am not injured unavenged"'—a reference to the thistle, which is a symbol of Lorraine.
Place Stanislas, a large square built between March 1752 and November 1755 by Stanislaus I of Poland to link the medieval old town of Nancy and the new town built under Charles III in the 17th century, is a UNESCO World Heritage Site, the first place in France and in the top four in the world. The city also has many buildings listed as historical monuments and is one of the European centers of Art Nouveau thanks to the École de Nancy. Nancy is also one of the main university cities and, with the Centre Hospitalier Régional Universitaire de Brabois, the conurbation is home to one of the main health centers in Europe, renowned for its innovations in surgical robotics.

Python API lib:
- easy way to access a few data
- uses Wikipedia API (online only)
- not all details, not designed for batch processing
XML processing:
- works offline, all data accessible
- (very) hard to parse XML

Working from JSon

pre-converted Wiktionaries: https://kaikki.org/dictionary/French/index.html
get definition:

grep Nancy kaikki.org-dictionary-French.json

Working from JSon

{"pos": "name", "heads": [{"template_name": "fr-proper noun"}], "word": "Nancy", "lang": "French", "lang_code": "fr", "sounds": [{"ipa": "/n\u0251\u0303.si/"}], "categories": ["Cities in France"], "senses": [{"categories": ["French female given names", "French given names"], "tags": ["feminine"], "glosses": ["A female given name from English borrowed from English."], "id": "Nancy-name"}]}
{"pos": "name", "heads": [{"template_name": "fr-proper noun"}], "categories": ["Cities in France"], "word": "Nancy", "lang": "French", "lang_code": "fr", "sounds": [{"ipa": "/n\u0251\u0303.si/"}], "senses": [{"glosses": ["Nancy (the city)."], "derived": [{"word": "nanc\u00e9ien"}, {"word": "Nanc\u00e9ien"}], "id": "Nancy-name"}]}
{"pos": "noun", "heads": [{"1": "m", "f": "Nanc\u00e9ienne", "template_name": "fr-noun"}], "forms": [{"form": "Nanc\u00e9iens", "tags": ["plural"]}, {"form": "Nanc\u00e9ienne", "tags": ["feminine"]}], "word": "Nanc\u00e9ien", "lang": "French", "lang_code": "fr", "senses": [{"tags": ["masculine"], "glosses": ["an inhabitant of the city of Nancy"], "categories": ["Demonyms"], "id": "Nanc\u00e9ien-noun"}]}

Better way to extract word definitions:
- json is easier to parse than MediaWiki-XML

Working from the dump

Handle XML structure
Handle WikiMedia encoding

    <title>Nancy</title>
    <ns>0</ns>
    <id>271346</id>
    <revision>
      <id>29517974</id>
      <parentid>28848867</parentid>
      <timestamp>2021-06-10T00:19:11Z</timestamp>
      <contributor>
        <username>Lingua Libre Bot</username>
        <id>229398</id>
      </contributor>
      <comment>Ajout d'un fichier audio de prononciation depuis Lingua Libre</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="7282" xml:space="preserve">{{voir|nancy}}
== {{langue|fr}} ==

=== {{S|étymologie}} ===
: {{date|lang=fr|1073}} La première trace écrite de Nancy date du 29 avril 1073 (mention dans la charte de Pibon, évêque de Toul : « Olry, voué de Nancy » (« ''Odelrici advocati de Nanceio'' »). Le nom serait cependant d’origine celtique, car on le rapproche du gaulois ''{{lien|nantu-|gaulois}}''/''{{lien|nanto-|gaulois}}'', qui signifie « [[val]], [[vallée]] », ou de {{recons|lang-mot-vedette=fr|nantus|gaulois}} (« ruisseau »).

=== {{S|nom de famille|fr}} ===
'''Nancy''' {{pron|nɑ̃.si|fr}}
# Nom de famille.

=== {{S|nom propre|fr}} ===
{{fr-inv|nɑ̃.si|inv_titre=Nom propre}}
'''Nancy''' {{pron|nɑ̃.si|fr}}
# {{localités|fr|du département de la Meurthe-et-Moselle}} [[commune|Commune]], [[ville]] et [[chef-lieu de département]] [[français]], situé dans le département de la [[Meurthe-et-Moselle]].
#* '''''Nancy''' est une ville d’ordre et de lumières où dès le XVII{{e}} et le XVIII{{e}} siècle, des ducs intelligents furent, sans le savoir, les précurseurs heureux de nos urbanistes modernes.'' {{source|{{Citation/Ludovic Naudeau/La France se regarde/1931}}}}
#* ''La verrerie de '''Nancy''' est de fondation récente, puisqu’elle date de 1875 seulement.'' {{source|Gustave Fraipont; ''Les Vosges'', 1923}}

Parsing XML structure

import bz2
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            print(self._values['title'], self._values['text'])

handler = WikiXmlHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

with bz2.open("frwiktionary-20210720-pages-meta-current.xml.bz2", "rt") as f:
    for line in f:
        parser.feed(line)

Parsing MediaWiki

import re
import mwparserfromhell

etyl = re.compile(r'{{étyl\|([^\|]*)\|[^\|]*\|([^\|}]*)[^}]*}}')

def handleWikiMedia(title,text):

    title = mwparserfromhell.parse(title)
    s=title.strip_code().strip()
    if not ':' in s:
        print("DEF "+s)
        wiki = mwparserfromhell.parse(text)
        for l in str(wiki).split('\n'):
            l = l.strip()
            if l.startswith(':'):
                l = etyl.sub("(\g<1>) \"\g<2>\"",l)
                s=l[1:]
                ss = mwparserfromhell.parse(s)
                s = ss.strip_code(normalize=True, collapse=True, keep_template_params=False).strip()
                if len(s)>0: print("   "+s)
        print()

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            # print(self._values['title'], self._values['text'])
            handleWikiMedia(self._values['title'], self._values['text'])

In the previous code, all WikiMedia “templates” are removed, except the “etymology” template.
A better quality may be obtained by installing a local instance of WikiMedia and extracting all definitions with calls to local API.

Wrap-up

Several ways to extract words definitions:
- Scripts: fast, but very approximate
- Using Wikipedia API: does not scale
- Using preprocessed JSON files: not up-to-date
- From XML dumps: a bit tricky
- From local API: complex and costly

Hands-on

Check on your laptop that the code above is running fine and that it does extract a list of definitions
There are still a number of mistakes: try and improve the code to fix some of them
how long would it take, in your laptop, to extract all definitions from the French Wiktionary ?

N-grams

Unigrams

Raw counts depend on the size of the data
Normalized, give 1-gram:

\[P(w) = \frac {N(w)} {N(*)}\]

Probability that a word occur in the language

Bigrams

Count the sequences \(N(a,b)\)
Divides by all sequences \(N(a,*)\)
2-gram gives probability that b follows a:

\[P(b|a) = \frac {N(a,b)} {N(a,*)}\]

Note:

\[P(b|a) = \frac {P(a,b)}{P(a)} = \frac{\frac{N(a,b)}{N(*,*)}}{\frac{N(a)}{N(*)}}\]

N-grams

Generalisation to a sequence of length \(n\)

\[P(w_t|w_{t-n+1},\dots,w_{t-1}) = \frac{N(w_{t-n+1},\dots,w_{t-1},w_t)}{N(w_{t-n+1},\dots,w_{t-1},*)}\]

We’ll use N-grams for:

diachronic analysis
extracting collocations
…

Training N-grams

Easy to train:
- accumulate counts
- can be done online
The most difficult is to scrap & pre-process texts, so:
- Google N-grams: https://books.google.com/ngrams
- Trained on 1,000G-tokens https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
- Free (trained on 430M-words) https://www.ngrams.info/

Rare/unseen sequences

Sol 1: N-gram smoothing
- Add pseudo-count for every possible sequence

\[P(w_t|w_{t-n+1},\dots,w_{t-1}) = \frac{1+N(w_{t-n+1},\dots,w_{t-1},w_t)}{\sum_x \left( 1+ N(w_{t-n+1},\dots,w_{t-1},x) \right)}\]

Other smoothings Good Turing, Kneser-Ney…

Problem: all unseen sequences have the same probability
Smoothing may be used in conjonction with backoff:
- linear interpolation: \[\hat P(w_t|w_{t-n+1},\dots,w_{t-1}) = \lambda P(w_t|w_{t-n+1},\dots,w_{t-1}) + \] \[(1-\lambda) P(w_t|w_{t-n+2},\dots,w_{t-1})\]
Other backoffs: Katz…

Sol 2: N-grams of sub-words
- Character n-grams
  - Good for agglutinative languages…
  - Capture common prefixes, suffixes…
  - Very good at language detection
  - Handle proper names
  - Robust to typographic mistakes
  - But requires much more data than words n-grams !
  - Often combined with words n-grams

Limitations

Number of potential n-grams increase exponentially
Longer n-grams become very sparse:
- bad statistics
- cannot capture long dependencies
In practice: maximum 5-grams