Workshops

Table of contents

Workshop 1 (07.11.2017)
Workshop 2 (05.06.2018)
Workshop 3 (10.12.2018)
Workshop 4 (17.06.2019)

The initial aim of the “Lemmes” working group is to help develop lemmatization tools for medieval languages (Latin, French, English, etc.) and to encourage the distribution of lemmatized textual corpora. Although still little known and practiced by the medieval historian community, this operation, which is fundamental to any approach to large corpora, is nonetheless indispensable in the case of medieval languages with a high degree of inflectionality and orthographic variation. With this in mind, four “Introduction to lemmatization of medieval texts” workshops were held between 2017 and 2019.

Workshop 1 (07.11.2017)

A first workshop was organized in November 2017. Its aim was to take joint stock of existing tools, application stumbling blocks and the directions and actions to be developed by the group. It brought together around twenty participants, researchers and PhD students, historians, linguists and programmers.

After the presentation and discussion of various “lemmatizers”: Collatinus (Y. Ouvrard, Ph. Verkerk), Pandora (J.- B. Camps), CompHistSem (T. Geelhaar), OMNIA (R. Alexandre), PALM (M. Aouini, C. Fletcher, A. Mairey), we were able to note that, whatever their objectives or structures (lexicon plus training; neural network), all the tools (taggers and/or parameters) presented are estimated to perform at around 90% (±5%). It’s the 5-15% error rate that needs to be addressed. In general, applications stumble where the historical problems lie (S. Torres). Recognition of proper nouns (people and places) is one of the most recurrent labeling errors, and the group will be continuing its work on this point.

Another issue raised was the lack of systematic, comparative assessment of taggers (N. Perreaux). It was decided to create a structured reference corpus, with a sample of different text types (in medieval Latin, French and English) to test the tools with the same corpus in order to obtain a reasoned evaluation. The aim would be to use these experiments to design a “meta-tagger” (T. Geelhaar) combining the advantages of the solutions proposed by each tool. To achieve this, group members will share the various parameters and resources already available. A file-sharing space for the group has been opened in Sharedocs of TGIR Huma-Num.

Download presentations and minutes of Workshop 1

Workshop 2 (05.06.2018)

Workshop 2 was articulated in three moments, first the presentation of two recent researches on named entities: “Multilevel approach for the recognition of named entities in Middle French”, by Mourad Aouini (CNRS - CLT), and “Automatic retrieval of named entities in mediolatin charters. Modélisation et perspectives d’utilisation”, by Sergio Torres (UVSQ - DYPAC). Indeed, one of the main difficulties in lemmatizing medieval sources lies in the recognition of personal and place names, and research into named entities opens up a series of perspectives for their particular treatment.

Then, to complement the presentations of lemmatization projects and tools in Workshop 1 (Collatinus, CompHistSem, Omnia, Pandora, Palm), the precursor project “Opera latina” (LASLA), was presented by Dominique Longrée and Margherita Fantoli (Université de Liège). Finally, there was a lively discussion on the implementation of comparative analysis of different parameters and lemmatizers in medieval languages. A consensus was reached on the need for a lemmatized corpus (completely, right down to the final product) and verified (possibly by several different people), to be able to seriously evaluate the different tools, as well as the need for each tool’s data description (label sets, formats…) with a view to arriving at a common format.

Download the minutes and documents related to Workshop 2

Workshop 3 (10.12.2018)

Workshop 3 aimed to continue developing the actions listed above (evaluation, dissemination, training). This was an opportunity for the various project/tool holders to exchange views on their technical parameters and corpora. The concrete experience of lemmatizing the multilingual corpus of medieval Burgundian inscriptions was presented and discussed, to help establish an analysis model. Recent advances in tools that have developed parameters for medieval Latin were also presented (Collatinus, Hydra, PALM).

Download the minutes and documents related to Workshop 3

Workshop 4 (17.06.2019)

During Workshop 4, in addition to a historiographical and theoretical introduction, participants were offered the opportunity to discover a number of key tools through practical exercises. Initially planned for 15, the workshop ended up welcoming 21 participants, given the interest aroused by this training course. The team of trainers included 8 lecturers and made all the teaching materials available to participants (user guides, pre-processed corpora, exemplars, slideshows, etc.). The wide range of participants - from Master’s students to senior lecturers and researchers, doctoral students, post-doctoral fellows and engineers - shows that the needs in this field are not confined to “academic” medievalists, historians and linguists, but also to the commercial managers of digital publishing platforms who may have to deal with medieval texts. For the rest of the group’s activities, lemmatization is envisaged as an integral part of the training courses organized by COSME², which integrate this fundamental operation into a wider process that extends from the constitution of a corpus to its statistical exploitation.

Download the minutes and documents related to Workshop 4