Lemmatizers, parameters, tools

Table of contents

Lemmatizers and taggers
1. TreeTagger
2. Collatinus
3. spaCy
4. Pie (and Deucalion/Flask Pie)
5. StanfordNLP
6. WordNet (nltk)
7. TextBlob
8. MarMoT
9. FastText
10. LGeRM
11. UDPipe
12. FreeLing
13. CLTK
Data sets and parameter/model sets
Lemmatization post-processing tools
1. Pyrrha

Lemmatizers and taggers

TreeTagger

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Danish, Swedish, Norwegian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Belarusian, Ukrainian, Galician, Greek, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish, Persian, Romanian, Czech, Coptic and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Website: TreeTagger

Collatinus

Collatinus is both a lemmatizer and a morphological analyzer for Latin texts: if given a declension or conjugation form, it can find out which word to look up in the dictionary to get its translation into another language, its different meanings, and all the other data a dictionary usually provides.

Although originally designed for Latin teachers, Collatinus has been expanded over the years to include a number of tools of interest to medievalists. In particular, it allows you to consult various dictionaries, including some dedicated to medieval Latin, some in image mode and others in full text (Du Cange and Köbler). As the different dictionaries don’t necessarily provide the same information, it can be handy to switch from one to the other with a single click. On the other hand, TextiColor can be a useful tool for spotting typos in the OCR of Latin text. Scansion and accentuation are also possible with Collatinus and can be applied, for example, to the study of metrical or rhythmic poetry. The use of cursus in prose can also help with dating. Further details can be found here: “Lemmatiser avec Collatinus 3”.

Website: Collatinus

spaCy

spaCy is a Python software library for automatic language processing (NLP). It implements various linguistic annotations to provide an overview of the grammatical structure of a text (word types, parts of speech, word relationships). It performs tokenization, lemmatization, POS tagging and other advanced operations.

Website: spaCy

Pie (and Deucalion/Flask Pie)

Pie is a language independant lemmatizer implemented in python and built for “variation-rich languages” which includes Latin. It’s a deep learning tool that can be trained and retrained with data in TSV format. As of 2019, it seems to be one of the state-of-the-art lemmatizers in terms of results. It can be trained jointly on morphology, POS and lemmatization tasks.

Website: Pie (and Deucalion/Flask Pie)

Additional links:

StanfordNLP

StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. In addition, it is able to call the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.

Website: StanfordNLP

WordNet (nltk)

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

Website: WordNet

TextBlob

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Website: TextBlob

MarMoT

MARMOT is a Python library for dealing with multimodal data consisting of both images and text. It does not require multimodal pretraining and it can deal with observations missing modalities. This repo is currently under construction and is in a very rough state; please feel free to reach out if you have any questions.

Website: MarMoT

FastText

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

Website: FastText

LGeRM

LGeRM (Lemmes Graphies et Règles Morphologiques, pronounced “it sprouts”), is a lemmatizer designed to manage the historical graphical variation of French. It was initially developed for Middle French (1330-1500) and then adapted to 16th and 17th century French. It can also handle modern French.

Website: LGeRM

UDPipe

UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given annotated data in CoNLL-U format. Trained models are provided for nearly all UD treebanks. UDPipe is available as a binary for Linux/Windows/OS X, as a library for C++, Python, Perl, Java, C#, and as a web service. Third-party R CRAN package also exists.

Website: UDPipe

FreeLing

FreeLing is a C++ library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc.) for a variety of languages (English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, among others).

Website: FreeLing

CLTK

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for the languages of pre–modern Eurasia. Pre-configured pipelines are available for 19 languages.

Website: CLTK

Data sets and parameter/model sets

Datasets for medieval languages, mainly Latin and Old French, are used to create the parameter sets and models employed by the tools. We present them together, given their interdependence.

OMNIA

(Latin, especially medieval Latin; TreeTagger) https://glossaria.eu/lemmatisation/#page-content

OMNIA was designed to be implemented in TreeTagger, which was developed for morphosyntactic tagging (POS - Part of Speech), but which also supports lemmatization. OMNIA provides both the parameters required for use with medieval Latin texts, and the files needed to recreate these parameters.

Website: OMNIA

Achim Stein

(old French; TreeTagger)

LatinCy

(Latin; Spacy: https://huggingface.co/latincy; https://spacy.io/universe/project/latincy; https://arxiv.org/abs/2305.04365)

Deucalion

(Old French; https://zenodo.org/record/3237455; lemmas from Tobler-Lommatzsch + POS from Cattex; for Pie)

Middle High German

(Middle High German; TreeTagger)

PALM

(TreeTagger) http://palm.huma-num.fr/PALM/

The PALM platform (Plateforme d’analyse linguistique médiévale) is designed for the processing of medieval texts (MEDITEXT): it is a system aimed less at philologists than at historians or philosophers who, wishing to undertake lexicometric or semantic research, come up against the absence of a standardized orthography and the geographical and chronological variability of medieval lexicons.

Through flexible text annotation, it enables semi-automated orthographic standardization and lemmatization of medieval texts in English, French and Latin.

Website: PALM

Download PALM presentation

Latin for TreeTagger

(TreeTagger: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)

Old French lemmatization

(Old French; TreeTagger: https://github.com/CristinaGHolgado/old-french-lemmatization)

eHumanities Desktop

(Latin; MarMoT https://www.texttechnologylab.org/applications/ehumanities-desktop/)

CLTK

(several ancient languages: Latin; Old English; Middle English; Middle High German; etc.: https://legacy.cltk.org/en/latest/index.html)

PaPie - models

Lemmatization post-processing tools

Pyrrha

Pyrrha is a simple Python Flask WebApp for accelerating the post-correction of lemmatized and morpho-syntactically tagged corpora.

Website: Pyrrha

Download the Pyrrha presentation

Lemmatizers, parameters, tools

Lemmatizers and taggers

TreeTagger

Collatinus

spaCy

Pie (and Deucalion/Flask Pie)

StanfordNLP

WordNet (nltk)

TextBlob

MarMoT

FastText

LGeRM

UDPipe

FreeLing

CLTK

Data sets and parameter/model sets

OMNIA

Lasla

LiLa Lemma Bank

Perseus TreeBank

Achim Stein

LatinCy

Deucalion

Middle High German

PALM

Latin for TreeTagger

Old French lemmatization

eHumanities Desktop

CLTK

PaPie - models

Lemmatization post-processing tools

Pyrrha