Next, module performs pre-processing  that include URLS removal, hash-tags, username & special characters, performing spelling correction with the aid of a dictionary, abbreviation substitutions, performing lemmatization
and stop words removal.
Table 1 - Dates of lemmatization
of rare diseases YEAR OF PATHOLOGY PUBLICATION 1732 Dengue 1734 Leprosy 1884 Albinism Diphtheria 1899 Brachycephaly 1925 Hydrocephalus 1927 Scleroderma Ichthyosis Acromegaly 1936  Hemophilia Microcephaly 1970 Achondroplasia Botulism 1984 Thalassemia 1989 Brucellosis Phenylketonuria Glioma 2001 Legionellosis Narcolepsy Nevus Source: Own elaboration.
Although the compilers of the corpus claim that it is equipped with various types of monolingual annotation, (41) such as tokenization, sentence splitting, lemmatization
, word sense annotation, and so on, a manual check showed that the frequency results correspond only to the particular token in the search field.
The workflow, or automated set of procedures, might perform what linguists refer to as lemmatization
on the string of words, which is to say, the trimming of each word into its smallest meaningful components, as well as removing plurals, capitalization, punctuation, and tense.
The first experiments in Croatian include [Tadic and Sojat, 2003] who use PoS filtering, lemmatization
and mutual information to identify candidate terms as a preprocessing step for terminological work, [Delac et al.
TreeTagger is used to classify extracted terms (concepts/relations) using the annotation and lemmatization
When a search term is preceded by one of these operators, the automatic synonymization and lemmatization
(finding grammatical variants) of search terms is turned off, and only exact matches for the query term should be retrieved.
This article deals with the lemmatization
of Old English and, more specifically, with the lemmas of verbs of the second weak class.
The stopword, the stemming, and the lemmatization
are representative pre-processing techniques in text mining.
In some cases it is similar to content analysis directed by semantic similarities, while in others it is simple lemmatization
A preliminary lemmatization
of the transcribed corpus (329,837 words) led to a final list of 150 keywords, each with a minimum of 99 occurrences.
TreeTagger is a part-of-speech tagger and a lemmatization
tool that is written in C++ .