Lexical Resources

Christophe Cerisara

2022/2023

Lexical Resources: introduction

Course plan (indicative !)

CM Topic
08/12 intro: get/build Lex Res
15/12 Overview of embeddings
19/01 WordNet + FrameNet
26/01 VerbNet + PropBank
02/02 Transformers 1
09/02 Transformers 2

Course requirements:

Content of the course

  • How to create (transform) LR
    • Manually: annotation guides, quality…

    • Automatically
      • Scrap/curate texts, RDF extraction…

      • Text processing: Ngram, embeddings
  • How to use LR
    • From XML files (SPARQL…)

    • From python & NLTK / words embeddings

Today’s concepts

  • Motivation, concepts about LR
  • Where to get LR from
  • How to process LR; example: Wiktionary

Concepts

What is a lexical resource ?

  • Axel Herold: “collection of lexical items, together with linguistic information and/or classification of these items”
  • lexical items = words, multi-words, sub-word units
  • dictionary = usable by humans
  • LR = usable by machines
  • data: spelling, phonetics, category, relations (semantics…)

Usable by machines ?

  • Ideally: standard formats (XML, TEI, RDF…)
  • In practice:
    • ad-hoc formats (CoNLL…)
  • Examples: CoNLL, TextGrid, TEI, LMF

Formats

CoNLL example

1    They     they    PRON    PRP    Case=Nom|Number=Plur               2    nsubj    2:nsubj|4:nsubj
2    buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0    root     0:root
3    and      and     CONJ    CC     _                                  4    cc       4:cc
4    sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2    conj     0:root|2:conj
5    books    book    NOUN    NNS    Number=Plur                        2    obj      2:obj|4:obj
6    .        .       PUNCT   .      _                                  2    punct    2:punct

TextGrid example

    item [1]:
       class = "IntervalTier"
       name = "sentence"
       xmin = 0
       xmax = 2.3
       intervals: size = 1
       intervals [1]:
          xmin = 0
          xmax = 2.3
          text = "říkej ""ahoj"" dvakrát"
    item [2]:
       class = "IntervalTier"
       name = "phonemes"
       xmin = 0
       xmax = 2.3
       intervals: size = 3
       intervals [1]:
          xmin = 0
          xmax = 0.7
          text = "r̝iːkɛj"
       intervals [2]:
          xmin = 0.7
          xmax = 1.6
          text = "ʔaɦɔj"

Text Encoding Initiative

  • different views of a dictionary:
    • typographic view: “the two-dimensional printed page” (include page layout)
    • editorial view: “one-dimensional sequence of tokens”
    • lexical view: structured lexicographical information, may include lexical data not present in the text
  • TEI guidelines: encode one view with primary XML structure; another view in XML attributes
  • lexical view = most important for Lexical Resources, so encoded in XML structure
<entry xml:id="a_1">
    <form>
        <orth>Bahnhof</orth>
    </form>
    <gramGrp>
        <pos value="N" />
        <gen value="masculine" />
    </gramGrp>
    <sense>
        <def>...</def>
        <cit>
            <quote>der Zug fährt in den Bahnhof ein</quote>
        </cit>
    </sense>
    <!-- ... -->
    <sense>
        <def>...</def>
    </sense>
</entry>
  • “entry” = basic unit of information
    • includes “form” and “sense”
  • “gramGrp” = grammatical information
    • may also be put as a child of “form”, to make it depend on a specific form
  • reference to the CLARIN concept registry:
<gramGrp>
  <pos value="N" 
    dcr:datcat="http://hdl.handle.net/11459/CCR_C-5524_d8864ad4-1bdf-ee56-594e-784312129ea7"
    dcr:valueDatcat="http://hdl.handle.net/11459/CCR_C-3347_7face0f5-7a72-7ec2-c988-7adba256cea9" />
</gramGrp>

Lexical Markup Format

  • focus only on the lexical representation of data == lexical view of TEI
  • goal = meta-model for all types of NLP lexicons
  • reference to data category registry mandatory
  • Packages: core, morphology, MWE, syntax…
<LexicalEntry>
    <feat att="partOfSpeech" val="noun" />
    <feat att="gender" val="masculine" />
    <Lemma>
        <feat att="writtenForm" val="Bahnhof" />
    </Lemma>
    <WordForm>
        <feat att="writtenForm" val="Bahnhof" />
        <feat att="grammaticalNumber" val="singular" />
    </WordForm>
    <WordForm>
        <feat att="writtenForm" val="Bahnhöfe" />
        <feat att="grammaticalNumber" val="plural" />
    </WordForm>
    <Sense id="s1">
    </Sense>
    <!-- ... -->
    <Sense id="sn">
    </Sense>    
</LexicalEntry>
  • LMF only focuses on NLP, so much tighter than TEI
  • so easier for electronic lexicographic resources
  • XML serialization: RELISH, KYOTO, LIRICS…

Content of LR

Two types of LR

  • Manually built
    • WordNet, FrameNet, VerbNet, PropBank, DBPedia, BabelNet, Wiktionary…
    • Example: BDLEX
      • contains 440k inflected forms
      • source: Univ. Paul Sabatier
      • distributor: ELRA (2k€ non commercial)

Two types of LR

  • “Automatically” built
    • Google Ngrams
    • Words embeddings
    • no XML any more !
  • Info indirectly accessible:
    • diachronic usage of words
    • lists of “function” words
    • semantic relations between words
    • phonetizer
    • translation dictionaries…

Static vs. dynamic LR

MCQ time !

Availability of LR

Non-free Lex Res

Main distributors:

  • ELRA/ELDA (Europe)
    • European companies/univ. (CNRS)
    • 1400 resources
    • CoNLL,
    • very few free resources:
      • MLCC Multilingual and Parallel Corpora

Non-free Lex Res

  • LDC (Linguistic Data Consortium) (USA)
    • funded by ARPA and NSF
    • hosted by Univ. Pennsylvania
    • catalogue: 100s of resources
      • TIMIT, Switchboard, Gigaword…

“Freely” available

  • Distributed by Univ./organisms
    • EuroParl: https://www.statmt.org/europarl/
    • CNRTL/ATILF:
      • TLFi, Morphalou, Frantext…
    • ORTOLANG:
      • 500 resources: Dicovalence, patois de St-Martin-la-Porte…
    • Isidore.science: 20 datasets

International Networks

  • https://live.european-language-grid.eu/
    • 800 links to resources (e.g. to ELRA…)
  • LINDAT
  • linghub.org
  • European Open Science Cloud

  • CLARIN
    • VLO (Virtual Language Observatory): 800k resources

International Networks

  • Linguistic Linked Open Data (LLOD) initiative
    • Principles:
      • Creative Commons licenses
      • Accessible via URI
      • RDF Standard
      • Linked Data

Uniform Resource Identifier

  • URI = unique identification
    • may be URN or URL
  • example of URN:
    • urn:isbn:0-486-27557-4
    • Specific edition of Shakespeare’s play Romeo and Juliet
  • example of URL: http://example.org/wiki/Main_Page

Resource Description Framework

  • family of W3C specifications
  • based on triples:
    • Subject: a resource
    • Predicate: relation
    • Object: another resource

LOD vision

linguistic-lod.org

Ex: Princeton WordNet

   <LexicalEntry id ='w44919'>
      <Lemma writtenForm='quantification' partOfSpeech='n'/>
      <Sense id='w44919_01003570-n' synset='eng-10-01003570-n'/>
      <Sense id='w44919_06165623-n' synset='eng-10-06165623-n'/>
   </LexicalEntry> [...]
   <Synset id='eng-10-01003570-n' baseConcept='3'>
      <Definition gloss="the act of discovering or expressing the quantity of something"> </Definition>
      <SynsetRelations>
         <SynsetRelation targets='eng-10-00996969-n' relType='hype'/>
         <SynsetRelation targets='eng-10-01003729-n' relType='hypo'/>
      </SynsetRelations>
   </Synset>

Collection of RDF triples

  • In theory = labeled directed multi-graph
  • In practice:
    • Relational database
    • or Triplestores
  • cf. course on Semantic Web
  • Tutorial on LOD

Development of LLOD

  • 2016: OntoLex-Lemon vocabulary (W3C)
  • WebAnnotation (W3C)
  • SPARQL to access RDF

  • Summer Datathon on LLOD
  • H2020 project “Prêt-à-LLOD”
  • ELEXIS project “European Lexicographic Infrastructure”

Resources

(Chiarcos, IWLTP’2020):

“Working with RDF normally requires a certain level of technical expertise, i.e., basic knowledge of SPARQL and at least one RDF format.”

Resources

https://livebook.manning.com/book/linked-data/chapter-2/1

Building LR from texts

Automatic LR creation

  • Typical lexical information:
    • forms + frequencies => usage
    • diachronic usage => lexical drift
    • co-occurrence => lexical semantics
    • embeddings => synonyms, antonyms…
    • relations => syntagmatic…
    • combination => compositional semantics
    • multi-word expressions
    • decompose them => morphology

Automatic LR creation

  • Advantages
    • Many (English) texts available (e.g. CommonCrawl)
    • Reduced costs
    • May be specialized to target domains
  • Drawbacks
    • Requires NLP expertise
    • Not as precise as a linguist

Quality ?

  • Size matters => must be computationally efficient

Text analysis

  • Challenges:
    • Choose one or multiple sources of texts
    • Scrap the texts (see other course)
    • Preprocess texts (see other course)
    • Extract lexical information

Sources of texts

  • Generic information or specialized domain (healthcare…) ?
  • Large variability in language:
    • Casual: forums, conversations…
    • Micro-blog
    • Formal: books…
    • Journalistic: news
    • Educative: moocs, tutorials…

Sources of texts

Have I the right to scrap the text ?

It’s not because it’s public that you can copy it !!

  • Check whether there is a license, like Creative Commons, otherwise: “all rights reserved”
  • Beyond legal aspects, more and more concerns about privacy & right to be forgotten
  • Anonymization does not guarantee privacy !

Sources of texts

Have I the right to scrap the text ?

  • Twitter provides API to download some data, but forbids you to keep them on your harddrive.
  • You cannot redistribute texts without explicit CC-BY licence

(Wait… doesn’t Google scrap the whole web since many years ?)

Sources of texts

  • Unsafe sources of texts:
    • Social media
    • Most web pages
  • Safe sources of texts:
    • Wikipedia & derivatives (CC-BY)
    • Scientific papers: arXiv, pubMed, HAL…
    • Owners datasets: AskUbuntu archives, reddit archives, (Common Crawl), (WebTimeMachine)…
    • Gutenberg, Gallica…

Sources of texts

There’s a lot more… inaccessible !

Scrapping texts

There are several ways to download corpora:

  • APIs: not standard, may change, heavy for servers
  • dumps archive (wikipedia, reddit…)
  • peer-to-peer (academic torrents)
  • OAI-PMH

See course on basic NLP techniques (Yannick Parmentier)

Pre-processing texts

  • Texts comes with metadata, in XML, JSON…
  • Use adequate parsers
  • Then:
    • Filter out garbage (other language, errors…)
    • Segmentation + tokenization
    • Normalize (dates…)
    • Compute features (POStags…)

See Claire Gardent’s course.

Example: wiktionary

First look at the data

  • how to quickly find the line number of a page:
bzcat fich.xml.bz2 | grep -n '<title>Nancy</title>'
  • look at the page:
bzcat fich.xml.bz2 | tail -n +12872178 | less
    <title>Nancy</title>
    <ns>0</ns>
    <id>271346</id>
    <revision>
      <id>29517974</id>
      <parentid>28848867</parentid>
      <timestamp>2021-06-10T00:19:11Z</timestamp>
      <contributor>
        <username>Lingua Libre Bot</username>
        <id>229398</id>
      </contributor>
      <comment>Ajout d'un fichier audio de prononciation depuis Lingua Libre</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="7282" xml:space="preserve">{{voir|nancy}}
== {{langue|fr}} ==

=== {{S|étymologie}} ===
: {{date|lang=fr|1073}} La première trace écrite de Nancy date du 29 avril 1073 (mention dans la charte de Pibon, évêque de Toul : « Olry, voué de Nancy » (« ''Odelrici advocati de Nanceio'' »). Le nom serait cependant d’origine celtique, car on le rapproche du gaulois ''{{lien|nantu-|gaulois}}''/''{{lien|nanto-|gaulois}}'', qui signifie « [[val]], [[vallée]] », ou de {{recons|lang-mot-vedette=fr|nantus|gaulois}} (« ruisseau »).

=== {{S|nom de famille|fr}} ===
'''Nancy''' {{pron|nɑ̃.si|fr}}
# Nom de famille.

=== {{S|nom propre|fr}} ===
{{fr-inv|nɑ̃.si|inv_titre=Nom propre}}
'''Nancy''' {{pron|nɑ̃.si|fr}}
# {{localités|fr|du département de la Meurthe-et-Moselle}} [[commune|Commune]], [[ville]] et [[chef-lieu de département]] [[français]], situé dans le département de la [[Meurthe-et-Moselle]].
#* '''''Nancy''' est une ville d’ordre et de lumières où dès le XVII{{e}} et le XVIII{{e}} siècle, des ducs intelligents furent, sans le savoir, les précurseurs heureux de nos urbanistes modernes.'' {{source|{{Citation/Ludovic Naudeau/La France se regarde/1931}}}}
#* ''La verrerie de '''Nancy''' est de fondation récente, puisqu’elle date de 1875 seulement.'' {{source|Gustave Fraipont; ''Les Vosges'', 1923}}

Using APIs

  • after pip install wikipedia
import wikipedia
print(wikipedia.summary("Nancy, France"))
Nancy is the capital of the northeastern French department of Meurthe-et-Moselle, the former capital of the Duchy of Lorraine, and then the French province of the same name. The metropolitan area of Nancy had a population of 511,257 inhabitants at the 2018 census, making it the 16th largest urban area in France and the Lorraine's largest. The population of the city of Nancy proper is 104,885.
The motto of the city is Non inultus premor, Latin for '"I am not injured unavenged"'—a reference to the thistle, which is a symbol of Lorraine.
Place Stanislas, a large square built between March 1752 and November 1755 by Stanislaus I of Poland to link the medieval old town of Nancy and the new town built under Charles III in the 17th century, is a UNESCO World Heritage Site, the first place in France and in the top four in the world. The city also has many buildings listed as historical monuments and is one of the European centers of Art Nouveau thanks to the École de Nancy. Nancy is also one of the main university cities and, with the Centre Hospitalier Régional Universitaire de Brabois, the conurbation is home to one of the main health centers in Europe, renowned for its innovations in surgical robotics.
  • Python API lib:
    • easy way to access a few data
    • uses Wikipedia API (online only)
    • not all details, not designed for batch processing
  • XML processing:
    • works offline, all data accessible
    • (very) hard to parse XML

Working from JSon

grep Nancy kaikki.org-dictionary-French.json

Working from JSon

{"pos": "name", "heads": [{"template_name": "fr-proper noun"}], "word": "Nancy", "lang": "French", "lang_code": "fr", "sounds": [{"ipa": "/n\u0251\u0303.si/"}], "categories": ["Cities in France"], "senses": [{"categories": ["French female given names", "French given names"], "tags": ["feminine"], "glosses": ["A female given name from English borrowed from English."], "id": "Nancy-name"}]}
{"pos": "name", "heads": [{"template_name": "fr-proper noun"}], "categories": ["Cities in France"], "word": "Nancy", "lang": "French", "lang_code": "fr", "sounds": [{"ipa": "/n\u0251\u0303.si/"}], "senses": [{"glosses": ["Nancy (the city)."], "derived": [{"word": "nanc\u00e9ien"}, {"word": "Nanc\u00e9ien"}], "id": "Nancy-name"}]}
{"pos": "noun", "heads": [{"1": "m", "f": "Nanc\u00e9ienne", "template_name": "fr-noun"}], "forms": [{"form": "Nanc\u00e9iens", "tags": ["plural"]}, {"form": "Nanc\u00e9ienne", "tags": ["feminine"]}], "word": "Nanc\u00e9ien", "lang": "French", "lang_code": "fr", "senses": [{"tags": ["masculine"], "glosses": ["an inhabitant of the city of Nancy"], "categories": ["Demonyms"], "id": "Nanc\u00e9ien-noun"}]}
  • Better way to extract word definitions:
    • json is easier to parse than MediaWiki-XML

Working from the dump

  • Handle XML structure
  • Handle WikiMedia encoding
    <title>Nancy</title>
    <ns>0</ns>
    <id>271346</id>
    <revision>
      <id>29517974</id>
      <parentid>28848867</parentid>
      <timestamp>2021-06-10T00:19:11Z</timestamp>
      <contributor>
        <username>Lingua Libre Bot</username>
        <id>229398</id>
      </contributor>
      <comment>Ajout d'un fichier audio de prononciation depuis Lingua Libre</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="7282" xml:space="preserve">{{voir|nancy}}
== {{langue|fr}} ==

=== {{S|étymologie}} ===
: {{date|lang=fr|1073}} La première trace écrite de Nancy date du 29 avril 1073 (mention dans la charte de Pibon, évêque de Toul : « Olry, voué de Nancy » (« ''Odelrici advocati de Nanceio'' »). Le nom serait cependant d’origine celtique, car on le rapproche du gaulois ''{{lien|nantu-|gaulois}}''/''{{lien|nanto-|gaulois}}'', qui signifie « [[val]], [[vallée]] », ou de {{recons|lang-mot-vedette=fr|nantus|gaulois}} (« ruisseau »).

=== {{S|nom de famille|fr}} ===
'''Nancy''' {{pron|nɑ̃.si|fr}}
# Nom de famille.

=== {{S|nom propre|fr}} ===
{{fr-inv|nɑ̃.si|inv_titre=Nom propre}}
'''Nancy''' {{pron|nɑ̃.si|fr}}
# {{localités|fr|du département de la Meurthe-et-Moselle}} [[commune|Commune]], [[ville]] et [[chef-lieu de département]] [[français]], situé dans le département de la [[Meurthe-et-Moselle]].
#* '''''Nancy''' est une ville d’ordre et de lumières où dès le XVII{{e}} et le XVIII{{e}} siècle, des ducs intelligents furent, sans le savoir, les précurseurs heureux de nos urbanistes modernes.'' {{source|{{Citation/Ludovic Naudeau/La France se regarde/1931}}}}
#* ''La verrerie de '''Nancy''' est de fondation récente, puisqu’elle date de 1875 seulement.'' {{source|Gustave Fraipont; ''Les Vosges'', 1923}}

Parsing XML structure

import bz2
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            print(self._values['title'], self._values['text'])

handler = WikiXmlHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

with bz2.open("frwiktionary-20210720-pages-meta-current.xml.bz2", "rt") as f:
    for line in f:
        parser.feed(line)

Parsing MediaWiki

import re
import mwparserfromhell

etyl = re.compile(r'{{étyl\|([^\|]*)\|[^\|]*\|([^\|}]*)[^}]*}}')

def handleWikiMedia(title,text):

    title = mwparserfromhell.parse(title)
    s=title.strip_code().strip()
    if not ':' in s:
        print("DEF "+s)
        wiki = mwparserfromhell.parse(text)
        for l in str(wiki).split('\n'):
            l = l.strip()
            if l.startswith(':'):
                l = etyl.sub("(\g<1>) \"\g<2>\"",l)
                s=l[1:]
                ss = mwparserfromhell.parse(s)
                s = ss.strip_code(normalize=True, collapse=True, keep_template_params=False).strip()
                if len(s)>0: print("   "+s)
        print()

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            # print(self._values['title'], self._values['text'])
            handleWikiMedia(self._values['title'], self._values['text'])
  • In the previous code, all WikiMedia “templates” are removed, except the “etymology” template.
  • A better quality may be obtained by installing a local instance of WikiMedia and extracting all definitions with calls to local API.

Wrap-up

  • Several ways to extract words definitions:
    • Scripts: fast, but very approximate
    • Using Wikipedia API: does not scale
    • Using preprocessed JSON files: not up-to-date
    • From XML dumps: a bit tricky
    • From local API: complex and costly

Hands-on

  • Check on your laptop that the code above is running fine and that it does extract a list of definitions
  • There are still a number of mistakes: try and improve the code to fix some of them
  • how long would it take, in your laptop, to extract all definitions from the French Wiktionary ?

N-grams

Unigrams

  • Raw counts depend on the size of the data
  • Normalized, give 1-gram:

\[P(w) = \frac {N(w)} {N(*)}\]

  • Probability that a word occur in the language

Bigrams

  • Count the sequences \(N(a,b)\)
  • Divides by all sequences \(N(a,*)\)
  • 2-gram gives probability that b follows a:

\[P(b|a) = \frac {N(a,b)} {N(a,*)}\]

Note:

\[P(b|a) = \frac {P(a,b)}{P(a)} = \frac{\frac{N(a,b)}{N(*,*)}}{\frac{N(a)}{N(*)}}\]

N-grams

  • Generalisation to a sequence of length \(n\)

\[P(w_t|w_{t-n+1},\dots,w_{t-1}) = \frac{N(w_{t-n+1},\dots,w_{t-1},w_t)}{N(w_{t-n+1},\dots,w_{t-1},*)}\]

We’ll use N-grams for:

  • diachronic analysis
  • extracting collocations

Training N-grams

  • Easy to train:
    • accumulate counts
    • can be done online
  • The most difficult is to scrap & pre-process texts, so:
    • Google N-grams: https://books.google.com/ngrams
    • Trained on 1,000G-tokens https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
    • Free (trained on 430M-words) https://www.ngrams.info/

Rare/unseen sequences

  • Sol 1: N-gram smoothing
    • Add pseudo-count for every possible sequence

\[P(w_t|w_{t-n+1},\dots,w_{t-1}) = \frac{1+N(w_{t-n+1},\dots,w_{t-1},w_t)}{\sum_x \left( 1+ N(w_{t-n+1},\dots,w_{t-1},x) \right)}\]

  • Other smoothings Good Turing, Kneser-Ney…
  • Problem: all unseen sequences have the same probability
  • Smoothing may be used in conjonction with backoff:
    • linear interpolation: \[\hat P(w_t|w_{t-n+1},\dots,w_{t-1}) = \lambda P(w_t|w_{t-n+1},\dots,w_{t-1}) + \] \[(1-\lambda) P(w_t|w_{t-n+2},\dots,w_{t-1})\]
  • Other backoffs: Katz…
  • Sol 2: N-grams of sub-words
    • Character n-grams
      • Good for agglutinative languages…
      • Capture common prefixes, suffixes…
      • Very good at language detection
      • Handle proper names
      • Robust to typographic mistakes
      • But requires much more data than words n-grams !
      • Often combined with words n-grams

Limitations

  • Number of potential n-grams increase exponentially
  • Longer n-grams become very sparse:
    • bad statistics
    • cannot capture long dependencies
  • In practice: maximum 5-grams