Automatic language processing glossary

Yves Ouvrard, March 2019

This document may be reproduced by any means, provided that no modifications are made and that this notice is preserved.

Morphological analysis – The morphological analysis consists in giving all the morphological traits of a form. For example, in French, the word aviez: second person, plural, indicative, active. The morphological analysis is usually accompanied by the lemmatization of the form and its POS: lemma avoir, verb, 2nd person plural indicative asset. Often a word has several possible morphological analyses. It’s the case, For example, of the form avions for which we will give:
- lemma avoir, verb, 1st person plural of the active indicative.
- lemma plane, noun, plural.
Source Code – Computer is unable to run orders expressed in natural language It must be provided with a sequence of ones and zeros that humans have the greatest difficulty in reading and to write. To obtain a program that can be used on a computer, the programmer therefore expresses himself in a computer language quite resembling natural language, but whose lexicon is very poor, and the syntax of absolute rigor. What he writes then is the source code from the program. A program called compiler then transforms the code source into executable code, this sequence of ones and zeros. Another one possibility is to use an interpreted language. It will then be a interpreter which transforms source code directly into instructions machine executable.
compiler, interpreter – Operation which consists in transforming the source code into code executable by the machine. The program responsible for compiling is a compiler. The interpreter, instead of generating an executable, reads the source code and immediately turns it into executable instructions.
encoding – In the early days of computing, memory was limited and expensive, and we contented ourselves with 127 characters, which fit in seven bits (zeros and ones): numbers, letters of the Latin alphabet, upper and lower case, and some punctuation. This first encoding, named ASCII (American Standard Code for Information Interchange), was quickly replaced by games of richer characters (255 characters), but which have the defect of being “localized” (Western Europe, Central Europe etc…). We ended up with the Unicode standard, which has 137,929 characters (it has the ability to manage more than a million). With “primitive” character sets (less than 256 characters), the encoding was done systematically on one byte which is, in fact, the basic element for data storage. For accented characters to appear correctly, it needed to be able to tell which character set was used. With the adoption of Unicode, the problem of encoding appeared: how to store information that would need, a priori, 16 or even 32 bits in systems where the building block has eight? Several solutions, with barbaric names (UTF-16LE, UTF-8, UTF-7), have been proposed. Today, UTF-8 is on its way to becoming THE standard. Before any automatic processing, it is prudent to know the encoding of the text to analyze. While being aware that the problem arises above all when accented characters or non-Latin characters (Greek or Cyrillic).
entry (dictionary) – In the dictionary, there is usually one entry per lemma. The entry begins with the form canonical of the word, followed by morphological indications (POS, genitive, $ primitive times), etymology, different translations, examples, and irregular shapes.
epenthesis – Epenthesis is the appearance, within a word, of a consonant intended to facilitate his pronunciation. In Medieval Latin, the classic damnum became dampnum; solemnitas became solempnitas.
flexion, paradigms, models – The inflection of a lemma is its ability to be written and pronounced in several manners. Inflection most often obeys a set of rules that we calls model. The model of a lemma is given by the dictionary, at the head of the entry, in an implicit way.
electronic formats, encoding – It has become very easy to obtain texts, from all periods and in any language. But to work on a text, you must first know what its format is, i.e. which method was used to save it on a machine. Here is a simplified list of formats of text:
- Pure text, whose encoding may vary. More and more, the texts are encoded in UTF-8;
- Tagged text (html, LaTeX, markdown) allows to change color, layout, size, font, weight, style, etc., to anywhere in the text. Most often, these indications not text are inserted into opening and closing tags. In the html format, the tags are the characters < and >.
- Files from word processors (odt LibreOffice, docx Word), whose rules are often very complex, and most of the time unusable as such by automatic analysis software. For remedy this drawback, it is necessary to convert these files into files text, or thanks to the internal converters (save as, export), or through specialized external converters.
form – A form is one of the elements of the flexion of a lemma. When we conjugate a verb, we cite one after the other all the forms of the verb. Each form can be labeled by a series of lines morphological: gender, number, case, person, mode, tense, voice, etc. Some POS have only one form, like the Latin preposition ad. Others have more than a hundred, like verbs.
Canonical form (or catchword) – A lemma can have a very large number of shapes, but also sometimes some variant graphics, for example negligo next to neglego. To designate this lemma, we use one of its forms, which corresponds to a precise morphology, chosen by consensus between grammarians. For French, we chose the singular of nouns, the masculine singular adjectives, the active present infinitive of verbs. ex. to like. For the Latin, we give the first person singular of the present indicative active eg. amo.
computer language – A computer language is a language designed specifically to be transformed into executable code, the only one the machine can obey. Very little numerous at the beginning of the computer age, these languages have multiplied. Almost all borrow their lexicon from English. Their syntax is extremely rigid, and the slightest error renders them unusable. When the programmer wrote a source code, he uses a compiler or an interpreter, specialized programs that convert source code into executable code.
natural language – Natural language is what humans have always used to communicate. whereas the computer languages were created from all parts by engineers.
Lemmatize, lemmatizer – To lemmatize a word is to find which lemma produced it. Quite often it there are several solutions, as for the word avions, given as an example in the article morphological analysis. There are therefore two meanings to lemmatize: either find all the lemmas that can produce a given form, or find the lemma that the author of the text used to produce this form. The lemmatization of a text is done in several steps:
1. Tokenization is transforming text into a list of shapes.
2. The search for suffixes foreign to the lemma (in Latin, -que, -ue, -ne);
3. Taking into account the ramiste spelling, graphic variants
4. Lemmatization itself: what lemmas can give this form?
5. In case of multiple answers, the ranking of the results, starting with the more likely.
lemma – A lemma is the constituent unit of a lexicon. In Latin, a lemma can meet in various forms, sometimes very different each other. For example, the four forms fers, ferre, latos, tulerunt belong to the same lemma fero. syntactic rules allow you to decide which form of the lemma to use. To grasp the meaning of a statement, it is essential to know how to identify the morphology of a shape, very fast operation and mostly unconscious.
lexicometry – Lexicometry is the study of the quantification of lemmas in a corpus, and of their distribution. It is used for many purposes: to identify the author of a text, date it, locate it, extract knowledge, compare, etc.
Model – A model is a set of rules that allows to inflect a lemma. We can therefore neither conjugate nor decline before knowing which model to apply. Among these rules, some give the method to calculate a radical, others say which ending to add to which radical to obtain the desired form. A model receives the name of a widely used lemma applying this model. For example: rosa, templum, amo It is better that this lemma has no ambiguity. For example, the noun amicus is also an adjective. Better to choose lupus, which is still a noun. Paper dictionaries give the model only implicitly. For example, instead of saying that the lemma cubitum follows the pattern templum, the dictionary gives the genitive (i) and the gender (n.): cubitum, i, n. : elbow. Academic grammar offers a reduced number of models. Latin lemmatizers use more than a hundred. In the inflection of the most common lemmas, one finds most of the time irregular forms, which disobey the rules of their model.
morphology – List of characteristic morphological traits of a shape. Here is a table simplified morphological traits according to the different POS, for the Latin language:
- name: case, number
- pronoun: case, gender, number
- adjective: case, gender, number, degree
- adverb: degree
- verb: person, number, tense, mode, voice.
- verb, adjectival forms: case, gender, number, tense, mood, voice Examples:
  1. sustulisti: tollo, 2nd person plural, perfect active indicative;
  2. gestas: gero, accusative feminine plural perfect passive participle.
  3. cupidissimorum: cupidus, masculine genitive (or neuter) plural, superlative
OCR – Optical Character Recognition, en. ROC (little used) OCR is a process that consists of photographing a text, and transforming the image obtained in text, which can then be corrected, and sent in a lemmatizer. OCR is far from foolproof, and human proofreading are needed to correct ocerized text.
POS – acronym for Part Of Speech. In French, there is no consensus. We find grammatical category, grammatical class, nature, and even part of speech. We can define the concept of grammatical class in two ways:
- by the morphological traits of its flexion;
- by the link of the lemma with the real world: thing or being (noun), property (adjective), predicate (verb).
prefix, suffix – Constituent elements of the lemma which ѕe stick at the beginning or at the end of a word. For example, the verb clamo can receive
- a prefix: declamo
- a suffix: clamito
- both: conclamito Latin knows a particular type of suffix, which, instead of modifying meaning of the word, adds a second word to it:
- the suffix -que, which is equivalent to et + the word: rogandisque is a another way of writing et rogandis.
- the suffix -ne, makes the sentence interrogative. Ad amicos confugiam I will take refuge with my friends. ad amicosne confugiam? Shall I take refuge with my friends?
- the suffix -ue, to indicate an alternative: bis terue deux or three times.
prosody, meter, amount, scansion, accent – Latin has long syllables and short syllables. The words are accented depending on the arrangement of these long and short. The study of this prosody can be interesting for studying poetry and clauses. It is possible to automatically chant a text, to accentuate it, and locate its clauses (characteristic sequence of long and breves at the end of a sentence).
radical, ending – The radical is the part of the lemma that does not change when it is inflected. The ending is what must be added to the radical to obtain a form. A lemma can have multiple radicals. Latin verbs usually have three: le infectum radical, the perfectum radical, and the supine radical. These radicals are given by the dictionary in the form of primitive times, and in the case of the model amo, they can be calculated by following simple rules.
Ramus – Pierre de la Ramée (1515 - 1572), in addition to a philosophical work considerable, is known to have systematically differentiated the two pronunciations of the letters u and i. By iuuenis, in ramist script, becomes juvenis. Secondary textbooks and dictionaries are in ramist spelling. Most of the critical editions remained in spelling Ancient.
syntax, parsing – The syntax of a language is the set of rules which make it possible to generate a grammatical statement in that language. Automatic parsing is possible, but obtains worse results than morphological analysis.
tagger, tagger – Or morpho-syntactic tagger. Classical lemmatizers are incapable, in the presence of several lemmatization solutions, to choose which is the one that the author has wanted to use, and which the intelligent reader recognizes without difficulty. Also the lemmatizer is often helped by a tagger, who uses statistics obtained from a training corpus. He’s a reader human who proceeds to a first labeling, and the computer takes over.
tokenization, token – Tokenization consists of transforming the text into a list of shapes, or tokens.
morphological trait – Which form to choose when we want to use a lemma in a statement? For example, among the forms of the lemma beau [belles, belle, beaux, beau], which one to choose in the sentence That the campaign is b…? For this, we will use a series of morphological traits that try to match reality :
- the gender: is countryside masculine or feminine?
- the number: are we talking about one or more campaigns? This reflection of reality is very imperfect. Here is the list of morphological traits used by Latin. Intentionally, the lists are in alphabetical order:
- case: ablative, accusative, dative, genitive, locative, nominative, vocative
- gender: feminine, masculine, neutral
- number: plural, singular
- person: second, first, third
- degree: comparative, positive, superlative
- tense: future, past future, imperfect, perfect, pluperfect, present
- mode: verbal adjective, indicative, gerund, imperative, infinitive, participle, subjunctive, supine
- voice: active, passive Each grammatical category uses one or more of these morphological features. The article POS lists them for each category.
graphic variant – Latin has a very long history, during which the principles of notation have evolved a lot, mainly to reflect phonetic evolution. We can distinguish :
- Assimilation, which transforms two successive consonants into a double consonant: adcedo becomes accedo, conminus becomes comminus.
- The contraction is the loss of a part of the word which is not pronounced plus: amaueram becomes amaram, adscendo becomes ascendo, periculum periclum.
- an abbreviation is the notation of part of the word, the beginning or the end. consulates is often written COSS, februarius becomes febr. Most first names are abbreviated.
- Agglutination consists of joining two successive words to form one: and enim becomes etenim, sic ut becomes sicut.
- The epenthesis. Also, some vowels closed or opened, diphthongs simplified into a single vowel: ameicus becomes amicus, auorsor, auersor. The consonants have also evolved: colos becomes color, caelebs is written caeleps, etc. A lemmatizer must be able to identify and process these graphic variants. From the 16^th century, at the instigation of Ramus, we added two letters in the Latin alphabet, j and v, in order to distinguish the two pronunciations of i and u. In Germanic proper names, the ligature of two successive v, vv, has become the letter w.