Lexical Resources

Christophe Cerisara

2023/2024

Lexical Resources: introduction

Course plan (indicative !)

CM Topic
07/12 intro: get/build Lex Res
08/01 Overview of embeddings
15/01 WordNet + FrameNet
22/01 VerbNet + PropBank
29/01 Transformers 1
05/02 Transformers 2
16/02 Examen

Course requirements:

Content of the course

  • How to create (transform) LR
    • Manually: annotation guides, quality…

    • Automatically
      • Scrap/curate texts, RDF extraction…

      • Text processing: Ngram, embeddings, LLM
  • How to use LR
    • From XML files (SPARQL…)

    • From python: NLTK, words embeddings, LLM

Today’s concepts

  • Motivation, concepts about LR
  • Where to get LR from
  • How to process LR; example: Wiktionary

Concepts

What is a lexical resource ?

  • Axel Herold: “collection of lexical items, together with linguistic information and/or classification of these items”
  • lexical items = words, multi-words, sub-word units
  • dictionary = usable by humans
  • LR = usable by machines
  • information: spelling, phonetics, category, relations (semantics…)

Usable by machines ?

  • Ideally: standard formats (XML, TEI, RDF…)
  • In practice:
    • ad-hoc formats (CoNLL, JSON, text…)
  • Examples: CoNLL, TextGrid, TEI, LMF

Formats

CoNLL example

1    They     they    PRON    PRP    Case=Nom|Number=Plur               2    nsubj    2:nsubj|4:nsubj
2    buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0    root     0:root
3    and      and     CONJ    CC     _                                  4    cc       4:cc
4    sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2    conj     0:root|2:conj
5    books    book    NOUN    NNS    Number=Plur                        2    obj      2:obj|4:obj
6    .        .       PUNCT   .      _                                  2    punct    2:punct

TextGrid example

    item [1]:
       class = "IntervalTier"
       name = "sentence"
       xmin = 0
       xmax = 2.3
       intervals: size = 1
       intervals [1]:
          xmin = 0
          xmax = 2.3
          text = "říkej ""ahoj"" dvakrát"
    item [2]:
       class = "IntervalTier"
       name = "phonemes"
       xmin = 0
       xmax = 2.3
       intervals: size = 3
       intervals [1]:
          xmin = 0
          xmax = 0.7
          text = "r̝iːkɛj"
       intervals [2]:
          xmin = 0.7
          xmax = 1.6
          text = "ʔaɦɔj"

Text Encoding Initiative

  • different views of a dictionary:
    • typographic view: “the two-dimensional printed page” (include page layout)
    • editorial view: “one-dimensional sequence of tokens”
    • lexical view: structured lexicographical information, may include lexical data not present in the text
  • TEI guidelines: encode one view with primary XML structure; another view in XML attributes
  • lexical view = most important for Lexical Resources, so encoded in XML structure
<entry xml:id="a_1">
    <form>
        <orth>Bahnhof</orth>
    </form>
    <gramGrp>
        <pos value="N" />
        <gen value="masculine" />
    </gramGrp>
    <sense>
        <def>...</def>
        <cit>
            <quote>der Zug fährt in den Bahnhof ein</quote>
        </cit>
    </sense>
    <!-- ... -->
    <sense>
        <def>...</def>
    </sense>
</entry>
  • “entry” = basic unit of information
    • includes “form” and “sense”
  • “gramGrp” = grammatical information
    • may also be put as a child of “form”, to make it depend on a specific form
  • reference to the CLARIN concept registry:
<gramGrp>
  <pos value="N" 
    dcr:datcat="http://hdl.handle.net/11459/CCR_C-5524_d8864ad4-1bdf-ee56-594e-784312129ea7"
    dcr:valueDatcat="http://hdl.handle.net/11459/CCR_C-3347_7face0f5-7a72-7ec2-c988-7adba256cea9" />
</gramGrp>

Lexical Markup Format

  • focus only on the lexical representation of data == lexical view of TEI
  • goal = meta-model for all types of NLP lexicons
  • reference to data category registry mandatory
  • Packages: core, morphology, MWE, syntax…
<LexicalEntry>
    <feat att="partOfSpeech" val="noun" />
    <feat att="gender" val="masculine" />
    <Lemma>
        <feat att="writtenForm" val="Bahnhof" />
    </Lemma>
    <WordForm>
        <feat att="writtenForm" val="Bahnhof" />
        <feat att="grammaticalNumber" val="singular" />
    </WordForm>
    <WordForm>
        <feat att="writtenForm" val="Bahnhöfe" />
        <feat att="grammaticalNumber" val="plural" />
    </WordForm>
    <Sense id="s1">
    </Sense>
    <!-- ... -->
    <Sense id="sn">
    </Sense>    
</LexicalEntry>
  • LMF only focuses on NLP, so much tighter than TEI
  • so easier for electronic lexicographic resources
  • XML serialization: RELISH, KYOTO, LIRICS…

Content of LR

Two types of LR

  • Manually built
    • WordNet, FrameNet, VerbNet, PropBank, DBPedia, BabelNet, Wiktionary…
    • Example: BDLEX
      • contains 440k inflected forms
      • source: Univ. Paul Sabatier
      • distributor: ELRA (2k€ non commercial)

Two types of LR

  • “Automatically” built
    • Google Ngrams
    • Words embeddings, LLM
    • no XML any more !
  • Info contained in LLMs:
    • lexical, morphosyntax, syntax
    • semantics
    • dialogic, multilingual, multimodal
    • common sense, (historical) facts, emotions…

Usage of LLM

  • As a dictionary (for you)
    • just ask the LLM
  • As a Lexical Resource
    • through embeddings:
      • input -> LLM -> representation
      • repr -> code/ML -> task
    • through text generation:
      • input -> LLM -> text
      • text -> parsing -> code/ML

Availability of manual LR

Non-free Lex Res

Main distributors:

  • ELRA/ELDA (Europe)
    • European companies/univ. (CNRS)
    • 1400 resources
    • CoNLL,
    • very few free resources:
      • MLCC Multilingual and Parallel Corpora

Non-free Lex Res

  • LDC (Linguistic Data Consortium) (USA)
    • funded by ARPA and NSF
    • hosted by Univ. Pennsylvania
    • catalogue: 100s of resources
      • TIMIT, Switchboard, Gigaword…

“Freely” available

  • Distributed by Univ./organisms
    • EuroParl: https://www.statmt.org/europarl/
    • CNRTL/ATILF:
      • TLFi, Morphalou, Frantext…
    • ORTOLANG:
      • 500 resources: Dicovalence, patois de St-Martin-la-Porte…
    • Isidore.science: 20 datasets

International Networks

  • https://live.european-language-grid.eu/
    • 2000 links to resources (e.g. to ELRA…)
  • LINDAT
  • linghub.org
  • European Open Science Cloud

  • CLARIN
    • VLO (Virtual Language Observatory): 800k resources

International Networks

  • Linguistic Linked Open Data (LLOD) initiative
    • Principles:
      • Creative Commons licenses
      • Accessible via URI
      • RDF Standard
      • Linked Data

Uniform Resource Identifier

  • URI = unique identification
    • may be URN or URL
  • example of URN:
    • urn:isbn:0-486-27557-4
    • Specific edition of Shakespeare’s play Romeo and Juliet
  • example of URL: http://example.org/wiki/Main_Page

Resource Description Framework

  • family of W3C specifications
  • based on triples:
    • Subject: a resource
    • Predicate: relation
    • Object: another resource

(cf. course of Mathieu D’Aquin)

LOD vision

linguistic-lod.org

Ex: Princeton WordNet

   <LexicalEntry id ='w44919'>
      <Lemma writtenForm='quantification' partOfSpeech='n'/>
      <Sense id='w44919_01003570-n' synset='eng-10-01003570-n'/>
      <Sense id='w44919_06165623-n' synset='eng-10-06165623-n'/>
   </LexicalEntry> [...]
   <Synset id='eng-10-01003570-n' baseConcept='3'>
      <Definition gloss="the act of discovering or expressing the quantity of something"> </Definition>
      <SynsetRelations>
         <SynsetRelation targets='eng-10-00996969-n' relType='hype'/>
         <SynsetRelation targets='eng-10-01003729-n' relType='hypo'/>
      </SynsetRelations>
   </Synset>

Collection of RDF triples

  • In theory = labeled directed multi-graph
  • In practice:
    • Relational database
    • or Triplestores
  • cf. course on Semantic Web
  • Tutorial on LOD

Development of LLOD

  • 2016: OntoLex-Lemon vocabulary (W3C)
  • WebAnnotation (W3C)
  • SPARQL to access RDF

  • Summer Datathon on LLOD
  • H2020 project “Prêt-à-LLOD”
  • ELEXIS project “European Lexicographic Infrastructure”

Resources

(Chiarcos, IWLTP’2020):

“Working with RDF normally requires a certain level of technical expertise, i.e., basic knowledge of SPARQL and at least one RDF format.”

Resources

https://livebook.manning.com/book/linked-data/chapter-2/1

Availability of LLM

  • LLMs can be more or less open / closed
  • Totally closed LLMs:
    • ChatGPT, Bard, Claude…
  • Open:
    • distribute weights, training data + scripts + docs/logs
    • permissive license (commercial? use LLM outputs?…)
    • respect legislation? (RGPD, EU AI Act…)

  • LLM most compliant with AI-Act: Bloom!
    • RAIL ‘responsible’ license: open
    • but cannot use it for healthcare diagnostics!
  • Llama license: no commercial usage
  • Llama2 license: no training on outputs

400k LLMs available on Huggingface Hub !

Building LR from texts

  • Relying on manual LR is rarely a good option:
    • too costly & rare
    • Exception: collaborative LRs like Wiktionary!
  • Ex: you want a list of words in some domain
    • WordNet does not cover every domain
    • WordNet is not up-to-date
  • Better alternative: create target LR from raw data
    • can be specific for any domain/language/…
    • small cost
  • Ex: you want a list of words in some domain
    • grab in-domain texts; extract lexicon
  • Note: LLMs are LRs extracted from raw data
  • … But WordNet has much richer relations (antonyms, hypernyms…) than list of words!

  • cf. the old debate ML (coverage) vs. expertise/symbolic (precision, richness)
  • Not true anymore with LLMs:
    • LLMs know about most relations: antonyms, hypernyms…
    • and LLMs can learn more complex relations
  • So the best approach today:
    • extract in-domain LR and enrich them with LLM

Text sources

  • Generic
    • Many (English) texts available: CommonCrawl, ThePile, RedPajama (30T tokens)…
    • Rare languages? (7000 languages known)
      • Bible: 800-2000 languages
      • Lexicons: 1000s languagues
      • Internet: 100s languages
  • Specific
    • Ex: on github, 13T tokens of code + language
    • Shall be obtained: PDF scans, on-site speech recordings…

Required quantity

  • Depends on the objective/task
  • Training a new LLM:
    • cf. Chinchilla scaling laws
  • Adapting an LLM to a new language:
    • 1M tokens
  • Adapting an LLM to a domain:
    • 1000s tokens
    • few-shot learning
    • zero-shot learning

Text analysis in this course

  • Challenges:
    • Choose one or multiple sources of texts
    • Scrap the texts (see other course)
    • Preprocess texts (see other course)
    • Extract lexical information

Sources of texts

  • Generic information or specialized domain (healthcare…) ?
  • Large variability in language:
    • Casual: forums, conversations…
    • Micro-blog
    • Formal: books…
    • Journalistic: news
    • Educative: moocs, tutorials…

Sources of texts

Have I the right to scrap the text ?

It’s not because it’s public that you can copy it !!

  • Check whether there is a license, like Creative Commons, otherwise: “all rights reserved”
  • Beyond legal aspects, more and more concerns about privacy & right to be forgotten
  • Anonymization does not guarantee privacy !

Sources of texts

Have I the right to scrap the text ?

  • Twitter provided API to download some data, but forbids you to keep them on your harddrive.
  • You cannot redistribute texts without explicit CC-BY licence

(Wait… doesn’t Google scrap the whole web since many years ?)

Sources of texts

  • Unsafe sources of texts:
    • Social media
    • Most web pages
  • Safe sources of texts:
    • Wikipedia & derivatives (CC-BY)
    • Scientific papers: arXiv, pubMed, HAL…
    • Gutenberg, Gallica…
    • Owners datasets? AskUbuntu archives, reddit archives, (Common Crawl), (WebTimeMachine)…

Open text sources

  • The Pile
    • 825GB of text
    • CommonCrawl, PubMed, ArXiv, OpenWebText2, Bibliotik, Subtitles, Gutenberg…
  • C4 dataset
    • 380GB in JSON
    • CommonCrawl curated

Sources of texts

Scrapping texts

There are several ways to download corpora:

  • APIs: not standard, may change, heavy for servers
  • dumps archive (wikipedia, reddit…)
  • peer-to-peer (academic torrents)
  • OAI-PMH

See course on basic NLP techniques (Yannick Parmentier)

Pre-processing texts

  • Texts comes with metadata, in XML, JSON…
  • Use adequate parsers
  • Then:
    • Filter out garbage (other language, errors…)
    • Segmentation + tokenization
    • Normalize (dates…)
    • Compute features (POStags…)

See Claire Gardent’s course.

Example: wiktionary

First look at the data

  • look at the beginning of the file
    • locate the XML structure
    • locate the MediaWiki format
  • Easiest way to do that: bash!
    • linux OK, Mac OK, Windows: WSL
    bzcat frwik.xml.bz2 | less
  • find the position of the article about Nancy:
bzcat frwik.xml.bz2 | grep -n '<title>Nancy</title>'
  • look at the page:
bzcat frwik.xml.bz2 | tail -n +12872178 | less
    <title>Nancy</title>
    <ns>0</ns>
    <id>271346</id>
    <revision>
      <id>29517974</id>
      <parentid>28848867</parentid>
      <timestamp>2021-06-10T00:19:11Z</timestamp>
      <contributor>
        <username>Lingua Libre Bot</username>
        <id>229398</id>
      </contributor>
      <comment>Ajout d'un fichier audio de prononciation depuis Lingua Libre</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="7282" xml:space="preserve">{{voir|nancy}}
== {{langue|fr}} ==

=== {{S|étymologie}} ===
: {{date|lang=fr|1073}} La première trace écrite de Nancy date du 29 avril 1073 (mention dans la charte de Pibon, évêque de Toul : « Olry, voué de Nancy » (« ''Odelrici advocati de Nanceio'' »). Le nom serait cependant d’origine celtique, car on le rapproche du gaulois ''{{lien|nantu-|gaulois}}''/''{{lien|nanto-|gaulois}}'', qui signifie « [[val]], [[vallée]] », ou de {{recons|lang-mot-vedette=fr|nantus|gaulois}} (« ruisseau »).

=== {{S|nom de famille|fr}} ===
'''Nancy''' {{pron|nɑ̃.si|fr}}
# Nom de famille.

=== {{S|nom propre|fr}} ===
{{fr-inv|nɑ̃.si|inv_titre=Nom propre}}
'''Nancy''' {{pron|nɑ̃.si|fr}}
# {{localités|fr|du département de la Meurthe-et-Moselle}} [[commune|Commune]], [[ville]] et [[chef-lieu de département]] [[français]], situé dans le département de la [[Meurthe-et-Moselle]].
#* '''''Nancy''' est une ville d’ordre et de lumières où dès le XVII{{e}} et le XVIII{{e}} siècle, des ducs intelligents furent, sans le savoir, les précurseurs heureux de nos urbanistes modernes.'' {{source|{{Citation/Ludovic Naudeau/La France se regarde/1931}}}}
#* ''La verrerie de '''Nancy''' est de fondation récente, puisqu’elle date de 1875 seulement.'' {{source|Gustave Fraipont; ''Les Vosges'', 1923}}

Using APIs

  • after pip install wikipedia
import wikipedia
print(wikipedia.summary("Nancy, France"))
Nancy is the capital of the northeastern French department of Meurthe-et-Moselle, the former capital of the Duchy of Lorraine, and then the French province of the same name. The metropolitan area of Nancy had a population of 511,257 inhabitants at the 2018 census, making it the 16th largest urban area in France and the Lorraine's largest. The population of the city of Nancy proper is 104,885.
The motto of the city is Non inultus premor, Latin for '"I am not injured unavenged"'—a reference to the thistle, which is a symbol of Lorraine.
Place Stanislas, a large square built between March 1752 and November 1755 by Stanislaus I of Poland to link the medieval old town of Nancy and the new town built under Charles III in the 17th century, is a UNESCO World Heritage Site, the first place in France and in the top four in the world. The city also has many buildings listed as historical monuments and is one of the European centers of Art Nouveau thanks to the École de Nancy. Nancy is also one of the main university cities and, with the Centre Hospitalier Régional Universitaire de Brabois, the conurbation is home to one of the main health centers in Europe, renowned for its innovations in surgical robotics.
  • Python API lib:
    • easy way to access a few data
    • uses Wikipedia API (online only)
    • not all details, not designed for batch processing
  • XML processing:
    • works offline, all data accessible
    • (very) hard to parse XML

Working from JSon

grep Nancy kaikki.org-dictionary-French.json

Working from JSon

{"pos": "name", "heads": [{"template_name": "fr-proper noun"}], "word": "Nancy", "lang": "French", "lang_code": "fr", "sounds": [{"ipa": "/n\u0251\u0303.si/"}], "categories": ["Cities in France"], "senses": [{"categories": ["French female given names", "French given names"], "tags": ["feminine"], "glosses": ["A female given name from English borrowed from English."], "id": "Nancy-name"}]}
{"pos": "name", "heads": [{"template_name": "fr-proper noun"}], "categories": ["Cities in France"], "word": "Nancy", "lang": "French", "lang_code": "fr", "sounds": [{"ipa": "/n\u0251\u0303.si/"}], "senses": [{"glosses": ["Nancy (the city)."], "derived": [{"word": "nanc\u00e9ien"}, {"word": "Nanc\u00e9ien"}], "id": "Nancy-name"}]}
{"pos": "noun", "heads": [{"1": "m", "f": "Nanc\u00e9ienne", "template_name": "fr-noun"}], "forms": [{"form": "Nanc\u00e9iens", "tags": ["plural"]}, {"form": "Nanc\u00e9ienne", "tags": ["feminine"]}], "word": "Nanc\u00e9ien", "lang": "French", "lang_code": "fr", "senses": [{"tags": ["masculine"], "glosses": ["an inhabitant of the city of Nancy"], "categories": ["Demonyms"], "id": "Nanc\u00e9ien-noun"}]}
  • Better way to extract word definitions:
    • json is easier to parse than MediaWiki-XML

Working from the dump

  • Handle XML structure
  • Handle WikiMedia encoding
    <title>Nancy</title>
    <ns>0</ns>
    <id>271346</id>
    <revision>
      <id>29517974</id>
      <parentid>28848867</parentid>
      <timestamp>2021-06-10T00:19:11Z</timestamp>
      <contributor>
        <username>Lingua Libre Bot</username>
        <id>229398</id>
      </contributor>
      <comment>Ajout d'un fichier audio de prononciation depuis Lingua Libre</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="7282" xml:space="preserve">{{voir|nancy}}
== {{langue|fr}} ==

=== {{S|étymologie}} ===
: {{date|lang=fr|1073}} La première trace écrite de Nancy date du 29 avril 1073 (mention dans la charte de Pibon, évêque de Toul : « Olry, voué de Nancy » (« ''Odelrici advocati de Nanceio'' »). Le nom serait cependant d’origine celtique, car on le rapproche du gaulois ''{{lien|nantu-|gaulois}}''/''{{lien|nanto-|gaulois}}'', qui signifie « [[val]], [[vallée]] », ou de {{recons|lang-mot-vedette=fr|nantus|gaulois}} (« ruisseau »).

=== {{S|nom de famille|fr}} ===
'''Nancy''' {{pron|nɑ̃.si|fr}}
# Nom de famille.

=== {{S|nom propre|fr}} ===
{{fr-inv|nɑ̃.si|inv_titre=Nom propre}}
'''Nancy''' {{pron|nɑ̃.si|fr}}
# {{localités|fr|du département de la Meurthe-et-Moselle}} [[commune|Commune]], [[ville]] et [[chef-lieu de département]] [[français]], situé dans le département de la [[Meurthe-et-Moselle]].
#* '''''Nancy''' est une ville d’ordre et de lumières où dès le XVII{{e}} et le XVIII{{e}} siècle, des ducs intelligents furent, sans le savoir, les précurseurs heureux de nos urbanistes modernes.'' {{source|{{Citation/Ludovic Naudeau/La France se regarde/1931}}}}
#* ''La verrerie de '''Nancy''' est de fondation récente, puisqu’elle date de 1875 seulement.'' {{source|Gustave Fraipont; ''Les Vosges'', 1923}}

Parsing XML structure

  • Python code with xml.sax lib:
    • create a XML SAX parser
    • feed every line into the parser
  • The parser calls user methods when:
    • start of XML element
    • end of XML element
    • for every char in the element
import bz2
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            print(self._values['title'], self._values['text'])

handler = WikiXmlHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

with bz2.open("frwik.xml.bz2", "rt") as f:
    for line in f:
        parser.feed(line)
  • Running the above code removes all XML structure:
accueil {{voir|Accueil}} 
 
 == {{langue|fr}} == 
 === {{S|étymologie}} === 
 : {{siècle|XII}} {{déverbal|de=accueillir|lang=fr|m=1}}. 
 
 === {{S|nom|fr}} === 
 {{fr-rég|a.kœj}} 
 '''accueil''' {{pron|a.kœj|fr}} {{m}} 
 # [[cérémonie|Cérémonie]] ou [[prestation]] réservée à un nouvel [[arrivant]], consistant généralement à lui [[souhaiter]] la [[bienvenue]] et à l’aider dans son [[intégration]] ou ses [[démarche]]s. 
 #* ''Nous réservâmes aux nouveaux venus un '''accueil''' qui fut cordial et empressé, mais le temps n’était pas aux effusions et d’un commun avis, il fallait agir vite.'' {{source|{{w|Jean-Baptiste Charcot}}, ''Dans la mer du Groenland'', 1928}} 
 #* ''Partout elle avait trouvé bon '''accueil''', prompt assentiment, mais elle se propose d’aller plus outre.'' {{source|{{Citation/Jean Rogissart/Passantes d’Octobre/1958|}}}} 
 #* ''Notre hôte, absent au moment de notre arrivée, ne tarde pas à paraître et me fait l’'''accueil''' auquel je m'attendais de sa part.'' {{source|{{w|Frédéric Weisgerber}}, ''Trois mois de campagne au Maroc : étude géographique de la région parcourue'', Paris : Ernest Leroux, 1904, page 38}} 

Parsing MediaWiki

  • Remove the MW tags from the title:
import mwparserfromhell
title = mwparserfromhell.parse(title)
s=title.strip_code().strip()
print(s)
  • We see that many non-interesting articles have a colon:
accueil
MediaWiki:Disclaimers
MediaWiki:Disclaimerpage
MediaWiki:Showhideminor
MediaWiki:Sitetitle
lire
encyclopédie
Discussion utilisateur:Romanito
Discussion utilisateur:Hippietrail
Utilisateur:Ryo
Utilisateur:Romanito
manga
Utilisateur:Ske
Utilisateur:Alno
Discussion utilisateur:Alno
Discussion utilisateur:Koxinga/2004-2008
ouvrage
  • So we can keep only interesting articles and look at the text:
import mwparserfromhell
title = mwparserfromhell.parse(title)
s=title.strip_code().strip()
if not ':' in s:
    print("TITLE ",s)
    wiki = mwparserfromhell.parse(text)
    print(wiki)
  • Still a lot of MW information!
DEF lire
WW {{voir/lire}} 
 
 == {{langue|fr}} == 
 === {{S|étymologie}} === 
 : ([[#fr-verb|Verbe]]) Du {{étyl|la|fr|mot=lego|dif=lĕgĕre|sens=''id.''}}, proprement « recueillir » 
 : ([[#fr-nom|Nom]]) De l’{{étyl|it|fr|mot=lira}}, du {{étyl|la|fr|libra}} (« [[livre]] », le poids). 
 
 === {{S|verbe|fr}} === 
 [[File:Reading Jane Eyre.jpg|thumb|Jeune fille qui '''lit'''. (1)]] 
 [[File:USMC-120302-M-PG598-001.jpg|thumb|Une adulte '''lit''' une histoire à un groupe d'enfants. (6)]] 
 '''lire''' {{pron|liʁ|fr}} {{conjugaison|fr}} {{conjugaison|fr|grp=3}} 
 # [[interpréter|Interpréter]] des [[information]]s écrites sous forme de [[mot]]s ou de [[dessin]]s sur un [[suppo
rt]]. 
 #* ''On '''lit''' ce livre absolument comme au bord de la cascade on entendrait, rêveur, le gazouillement des eaux
.'' {{source|{{w|Jules Michelet}}, ''Du prêtre, de la femme, de la famille'', 3{{e}} éd., Hachette  &  Paulin, 1845
, page 133}} 
  • No easy way to remove all MW metadata:
    • Install locally a MediaWiki server, or
    • Read the MW doc and implement a “small” parser for a subset of the MW format
    • Try to guess the role of some MW tags and quickly write a few rules to get what you want
  • To extract the etymology (from the “étyl” tag):
import re
import mwparserfromhell

etyl = re.compile(r'{{étyl\|([^\|]*)\|[^\|]*\|([^\|}]*)[^}]*}}')

def handleWikiMedia(title,text):

    title = mwparserfromhell.parse(title)
    s=title.strip_code().strip()
    if not ':' in s:
        print("DEF "+s)
        wiki = mwparserfromhell.parse(text)
        for l in str(wiki).split('\n'):
            l = l.strip()
            if l.startswith(':'):
                l = etyl.sub("(\g<1>) \"\g<2>\"",l)
                s=l[1:]
                ss = mwparserfromhell.parse(s)
                s = ss.strip_code(normalize=True, collapse=True, keep_template_params=False).strip()
                if len(s)>0: print("   "+s)
        print()

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            # print(self._values['title'], self._values['text'])
            handleWikiMedia(self._values['title'], self._values['text'])
DEF lire
   (Verbe) Du (la) "mot=lego", proprement « recueillir »
   (Nom) De l’(it) "mot=lira", du (la) "libra" (« livre », le poids).
   Du (la) "mot=lyra".

DEF encyclopédie
   Du (la) "encyclopaedia" forgé à la Renaissance sur la base du (grc) "ἐγκύκλιος" — voir  — et  soit le sens de « 
ensemble de toutes les sciences ».

DEF manga
   Du (ja) "漫画".
   Le mot manga a aussi été considéré comme féminin, mais l’usage tend à préférer le masculin.
   Du (ja) "漫画".
   (Nom commun 1) Du (la) "manica".
   (Nom commun 2) Du (ja) "漫画".
   Du (ja) "漫画".
   Du (la) "manica".
   (Nom commun 1) Du (la) "manica".
   (Nom commun 2) Du (ms) "mangga".
   Du (ja) "漫画".

DEF ouvrage
   Dérivé de ouvrer, ancienne forme de œuvrer, avec le suffixe -age.
   Par dérive populaire puis effet stylistique, on peut rencontrer le genre féminin.
   C’est de la belle ouvrage.
  • In the previous code, all WikiMedia “templates” are removed, except the “etymology” template.
  • A better quality may be obtained by installing a local instance of WikiMedia and extracting all definitions with calls to local API.

Wrap-up

  • Several ways to extract words definitions:
    • Scripts: fast, but very approximate
    • Using Wikipedia API: does not scale
    • Using preprocessed JSON files: not up-to-date
    • From XML dumps: a bit tricky
    • From local API: complex and costly

Hands-on

  • Check on your laptop that the code above is running fine and that it does extract a list of word+etymology
  • There are still a number of mistakes: try and improve the code to fix some of them
  • how long would it take, in your laptop, to extract all definitions from the French Wiktionary ?

N-grams

Unigrams

  • Raw counts depend on the size of the data
  • Normalized, give 1-gram:

\[P(w) = \frac {N(w)} {N(*)}\]

  • Probability that a word occur in the language

Bigrams

  • Count the sequences \(N(a,b)\)
  • Divides by all sequences \(N(a,*)\)
  • 2-gram gives probability that b follows a:

\[P(b|a) = \frac {N(a,b)} {N(a,*)}\]

Note:

\[P(b|a) = \frac {P(a,b)}{P(a)} = \frac{\frac{N(a,b)}{N(*,*)}}{\frac{N(a)}{N(*)}}\]

N-grams

  • Generalisation to a sequence of length \(n\)

\[P(w_t|w_{t-n+1},\dots,w_{t-1}) = \frac{N(w_{t-n+1},\dots,w_{t-1},w_t)}{N(w_{t-n+1},\dots,w_{t-1},*)}\]

We’ll use N-grams for:

  • diachronic analysis
  • extracting collocations

Training N-grams

  • Easy to train:
    • accumulate counts
    • can be done online
  • The most difficult is to scrap & pre-process texts, so:
    • Google N-grams: https://books.google.com/ngrams
    • Trained on 1,000G-tokens https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
    • Free (trained on 430M-words) https://www.ngrams.info/

Rare/unseen sequences

  • Sol 1: N-gram smoothing
    • Add pseudo-count for every possible sequence

\[P(w_t|w_{t-n+1},\dots,w_{t-1}) = \frac{1+N(w_{t-n+1},\dots,w_{t-1},w_t)}{\sum_x \left( 1+ N(w_{t-n+1},\dots,w_{t-1},x) \right)}\]

  • Other smoothings Good Turing, Kneser-Ney…
  • Problem: all unseen sequences have the same probability
  • Smoothing may be used in conjonction with backoff:
    • linear interpolation: \[\hat P(w_t|w_{t-n+1},\dots,w_{t-1}) = \lambda P(w_t|w_{t-n+1},\dots,w_{t-1}) + \] \[(1-\lambda) P(w_t|w_{t-n+2},\dots,w_{t-1})\]
  • Other backoffs: Katz…
  • Sol 2: N-grams of sub-words
    • Character n-grams
      • Good for agglutinative languages…
      • Capture common prefixes, suffixes…
      • Very good at language detection
      • Handle proper names
      • Robust to typographic mistakes
      • But requires much more data than words n-grams !
      • Often combined with words n-grams

Limitations

  • Number of potential n-grams increase exponentially
  • Longer n-grams become very sparse:
    • bad statistics
    • cannot capture long dependencies
  • In practice: maximum 5-grams