A use case for the OLKi platform

Wednesday, Apr 10, 2019| Tags: AI, project

John is a linguist who is currently working on a specific syntactic structure, the coreference. He uses for this an open corpus of conversational French speech, taken from ORTOLANG. He decides to annotate part of this corpus with coreference structures, and he then wants to distribute his annotations also with an open licence.

Beyond the dissemination of his work, John would also like to discuss with other linguists, as well as other scientists in general, who may be interested in these kinds of structures. In other words, he would like to build up a scientific community around co-reference in French.

In addition, he is quite not sure so far that his annotations are stable enough; he suspects there are still too many mistakes in his annotations to really release his dataset, and he is not completely sure about the licence, nor about the most appropriate format to distribute his corpus. He wants to keep the possibility to change the format in the near future.

John thus decides to ask the OLKi platform node located in Nancy to host his current version of the corpus. On this node, a “page”, similar to a blog, is dedicated to his corpus and presents it. On this page, John adds a link that points to the original data on ORTOLANG, and makes his own annotations available under the CC-BY licence.

Immediately after the upload of his dataset, the OLKi node indexes this dataset and makes it discoverable through a simple keywords search everywhere in the distributed OLKi platform. John’s corpus is also made visible on the front-page of the Nancy OLKi node.

Thanks to the fact that the OLKi node that hosts the corpus shares the same decentralized protocol than the Fediverse social medium, the keywords defined by John for his corpus are also searchable from the public Mastodon social medium, where more than 2.5 million citizens are freely discussing. John’s corpus is viewed on this social medium as a “person”, and people on Mastodon can send him messages and view the history of all the comments that are related to this corpus, sent either from Mastodon or from the OLKi platform (the interface between both networks is very porous, making interaction between them transparent). On one of these comments, Marie writes a (latex-like) equation to compute special word embeddings for the coreference task, which is rendered nicely on the OLKi comments and science-oriented Mastodon servers, like Mathstodon. Marie also points in her comment with a special link that points to a specific sentence in John’s corpus; when reading the comment, this sentence is shown, and clicking on it opens the browser at the page of the corpus to download it.

Marie is commenting John’s corpus from an OLKi node in Paris. She actually proposes a slightly different way to annotate coreferences; so she copies John’s annotations onto her Paris’ node with her own annotations. John is fine with that, and he really prefers to keep his own annotations without mixing them with Marie’s ones. On their respective corpus descriptions, both John and Marie describes the alternative annotations with cross-references. Marie is free to manage her own copy of the corpus, at no extra cost for John, and both versions still belong to the same global network and are connected although clearly identifiable.

Kyle also makes a copy of John’s corpus on his own home server, but creates a really messy annotation scheme, with which John and Marie clearly disagree. But Kyle does not care about their feedback. Therefore, John and Marie simply decide to not federate any more with Kyle’s node, and blacklist it, isolating him in his own sub-network.

Leila, a colleague of Marie, writes a machine learning model that automatically retrieves coreferences according to Marie’s model. Leila wraps her model within a simple bot in the OLKi network, which acts as a demonstrator of Marie’s ideas and tag sentences that people send it accordingly. Thanks to the fact that all of the federation APIs are open source, writing and controlling the bot is extremely easy and is done locally in Paris, although the bot is of course accessible from anywhere on the Fediverse. Leila could even propose and fork an extension of the OLKi’s API, which is still compatible with all the other instance but further adds some extra feature.

Interested to further investigate Marie’s proposal, from time to time, John sends Leila’s bot some tricky sentences to tag, which sometimes raise long discussions about the benefit of one annotation scheme versus another. Sooner or later, John and Marie will likely make their views converge and roll a new common release of their corpus.

Contact

Student, expert, technophobe, passer-by, we answer all your questions regarding the OLKi project.

Contact us