Research on automatic words definition generation
The massive amount of available natural language data opens new horizons for citizens to acquire knowledge more easily. Nevertheless, they are often confronted with the complexity of deeply understanding these sources of knowledge without the help of external linguistic resources. In particular, lexical resources such as WordNet or Réseau Lexical du Français (RLF) contain very rich information that can help the user understand words or concepts they are unfamiliar with.
On the other hand, lexical resources developed by experts may be sometimes hard to comprehend for a general audience. For instance, citizens may face the following problematic scenarios: (i) the definition of the word or the concept in the resource might be too difficult (technicality); (ii) the meaning of the word in the reading context might overlap between several senses defined in the resource (ambiguity); (iii) the word might not be covered by the resource (coverage).
In order to deal with such issues, the PhD thesis will consist in investigating new methods to extract appropriate semantic knowledge for a lexical unit in a given context. In particular, given a word and a context, the proposed methods will aim at automatically generating its definition as well as its semantic properties (e.g. coarse-grained sense, synonyms), adapted not only to its occurrence context, but also to the user.
With the deep learning revolution, the PhD project hypothesis is that it is now possible to fully model the task with neural networks including both the analysis of the word and its context, and the actual generation of a defining sentence in natural language. Such models would be trained from the content of lexical resources, and enriched with language models and word representations learned from large textual corpora in order to capture the lexical diversity and language style of a general audience. This approach could be very related to novel neural approaches used for machine translation or automatic summarization that both include an analysis and a generation phase.
While the topic proposed for investigation in this PhD project is related to the traditional tasks of Word Sense Disambiguation and Word Sense Induction, it is in fact more challenging as it takes a step further: a successful system is expected to generate definitions even for words and word senses that are not covered in existing lexical resources, generalizing from existing knowledge bases.