lemmatization


Also found in: Wikipedia.
Translations

lemmatization

[ˌlemətaɪˈzeɪʃən] Nlematización f
Collins Spanish Dictionary - Complete and Unabridged 8th Edition 2005 © William Collins Sons & Co. Ltd. 1971, 1988 © HarperCollins Publishers 1992, 1993, 1996, 1997, 2000, 2003, 2005

lemmatization

n (Ling) → Lemmatisierung f
Collins German Dictionary – Complete and Unabridged 7th Edition 2005. © William Collins Sons & Co. Ltd. 1980 © HarperCollins Publishers 1991, 1997, 1999, 2004, 2005, 2007
References in periodicals archive ?
* Lexical analysis (part of speech tagging, compound word detection) and syntactical analysis (disambiguation, lemmatization of nouns, verbs, adjectives)
Initially the basic lexicographic analysis was performed, which mainly covers the lemmatization and word frequency calculations, multivariate analyses using the Descending Hierarchical Classification (DHC) and post-factorial correspondence analysis (13).
Technically, it starts with the lemmatization (the algorithmic process of determining the lemma of a word based on its intended meaning, e.g.
In addition, it calculates the numbers of words, mean frequency and number of hapaxes (words with frequency 1); surveys the vocabulary and reduces terms based on their roots (lemmatization); creates a dictionary of reduced forms and identifies active and supplementary forms.
Preprocessing can also involve the removal of stop words, tokenization, lemmatization and stemming of words in the document, an expert need to have classified the training data into categories (for supervised learning) as it is such classification that the machine learning algorithm (MLA) will learn to form its classifier.
For experiment evaluation, the data was pre-processed with the TreeTagger5, POS tagger and lemmatization tool.
For this task we used the TreeTagger with the English model supplied (Schmid, 1995) and the Russian model trained by Sharoff and based on MULTEXT-East tagsetii (Sharoff et al., 2008); lemmatization in the Russian components was optimized with the aid of the lemma-prediction tool CSTlemma developed by Bart Jongejan (2006).
After cleaning, lemmatization, and stop-words deletion, the corpus contained 1,072,283 unique words and 103,933,786 instances.
For example, [26] applies different types of pre-processing of NLP in tasks, like: spelling errors, normalization, segmentation, stop words, lemmatization, and name recognition, among others.
TreeTagger permits regrouping the PoS tagging and the lemmatization: it groups together the different inflected forms of a word so they can be analyzed as a single item.