Three new stemmers for Slovenian language

The quality of automatic procedures involved in text processing phases of document database construction and natural language searching is determined to the great extent by the quality of their language-dependent part. The core of the language-dependent part is stemming especially when statistical approach to text processing is used. With stemmer we change all morphologic variants of the given type to the stem of that type and hopefully achieve the normalisation of the morphologic variability of natural language. The problem of stemming of English texts was successfully solved in eighties. In this text we describe our work at the development of the stemmer for Slovenian language.

Analysis of the results of the stemmer that we used and improved during the first year of the subproject's schedule showed its general tendency towards overstemming and suboptimal behaviour when dealing with words that originate from greek or latin roots. The latter property could be detremental to the quality of our search engine's results because a great deal of technical terms in medicine fall into that broad category.

Additional reasons for the renewed efforts regarding the development of Slovenian stemmer came from the machine learning part of the subproject. Users' interests formalised in user profiles are described with word stems which are used to generate queries to collect relevant documents from Internet's public search engines. Stemming used during the construction of their databases is far from optimal when they are dealing with less frequently spoken languages like Slovenian. We therefore need different stemmers for different language-dependent tasks aimed at with our subproject.

During the second year of the subproject three new versions of the stemmer for Slovenian language were developed. In this text we call them generic stemmer, optimal stemmer and sub-optimal stemmer. All of them are written in Java.

Generic stemmer for Slovenian language

Overstemming is the most frequent negative consequence of stemmers which are trying to model morhologically complex languages like Slovenian. Overstemming leads to cases when different word types produce the same stem with the consequence of fall in the search precision. We believe that the main reason for overstemming is the great number of rules needed to model the complex morphology. Because of that number greater possibility exist that some rule fires at the point where the word is already successfuly transformed into stem. Simple consideration leads to the conclusion that careful re-evaluation of existing rules with the goal of reduction of their number and better control of stop conditions could lead to improved stemming results.

The reduction of the number of rules that were used in our original stemmer was based on the fact that consonants in general have greater amount of information when compared to vowels. Good general policy to prevent overstemming would therefore be to preserve as many consonants as possible at the end of the stem, but to the extent that would not cause understemming.

The new stemmer for Slovenian language consults three lists:
 

CV_ENDINGS Subset of original endings list consisting only of the endings that split the word at the consonant-vowel pair.
CONS_PAIRS List of consonant-consonant pairs with rules for their transformation.
RECOD_RULES List of rules for solving the "irregularities" that were introduced into the word during its formation due to alterations, mainly e->zero alterations.

The algorithm itself works in three steps:

  1. The longest match from the CV_ENDINGS list.
  2. Iterative transformation of remaining consonant pairs at the end of the current word stump, consulting the CONS_PAIRS list at each iteration. The pair could be transformed into another pair, single consonant or an empty string. The loop ends when only one consonant remains or there is no rule in the database of rules for the remaining cononant pair.
  3. Possible transformation of stem complying to RECOD_RULES list.

Original list of endings used with previous stemmer consisted of over 5000 endings. Restraint to the subset of endings that split the word at the consonant-vowel pair greatly reduced this number. After the analysis of bigger corpus of Slovenian medical texts a certain number of new endings (complying to principles in step 1) mainly resulting from the originally non-Slovenian technical words was introduced into the list giving *** number to be inserted *** endings altogether. The list of rules for consonant pairs in step 2 trigger at *** number to be inserted *** different pairs. In step 3 the list of *** number to be inserted *** rules is used. During the evaluation phase all three lists are stil slowly growing and final numbers will be slightly higher.

*** the rule determining the minimal stem length and examples of stemming to be inserted here***

Optimal stemmer for Slovenian language

Algorithms used in stemmer for Slovenian language are based on simple statistical analysis of Slovenian medical texts and the main consideration in their creation was a tradeoff between their quality and speed. We had no intention to model the linguistic processes at work in language.

The reasons for development of optimal stemmer

The stemmer described here as "generic" is static in the sense that it could not be adapted to new sublanguages without manual changes in lists of rules. Furthermore, like any stemmer that works with longest match of endings, it does often not produce the longest stem common to all word forms of given type. The principle of the longest possible stem is important because it diminishes the possibility that some new word that was never before included in the analysis of stemming failures would produce stem already attributed to another word type.

With these two drawbacks in mind and with somewhat lesser priority of speed we developed a new variant of stemmer. Its main characteristics are evolutiveness and greater probability for words to give longest stems. Besides lists of endings and recoding rules described in section on generic stemmer, several other lists are used and updated during the process of stemming:
 

LEGAL_STEMS All stems produced and evaluated so far during the construction of database except those included into OTHER_STEMS list. Stems in this list are regarded as  examples of valid stems.
NEW_STEMS All new stems produced after the last "housekeeping" of lists and not evaluated yet. New stems are stems not found in LEGAL_STEMS.
OTHER_STEMS All stems that are valid indexing terms but are not included into LEGAL_STEMS list because they are not "normal ingredients of sublanguage". Stems in this list are mainly personal, family, or geographical names.

The algorithm

In contrast to the generic stemmer the process of stemming starts at the end of the word. The cursor marking the potential division point between stem and ending is gradually moving from right to left, one character at the time. At each character the full stemming is performed as in the following symbolic representation.
 

  1. Begin.
    STEM = word;
  2. Search for the STEM in LEGAL_STEMS and OTHER_STEMS list;
  3. If found go to 11;
  4. If MOVING_ALLOWED move the cursor in STEM one character to the left
    else go to 11;
  5. Search in the CV_ENDINGS list for the STEM's ending begining at the cursor;
  6. If found cut it off the STEM
    else go to 4;
  7. Iteratively process the STEM consonant pairs with the use of CONS_PAIRS list;
  8. Process the STEM with rules from RECOD_RULES list;
  9. Search for the STEM in LEGAL_STEMS and OTHER_STEMS list;
  10. If found  go to 11
    else go to 4;
  11. End.
    Pass the STEM to next indexing steps;


The heart of the algorithm are the same three steps in stemming like in the generic stemmer (steps 5-8 in the symbolic representation) which are tried for each move of the cursor. The stemming is not possible for each move but when it succeeds the algorithm searches the LEGAL_STEMS and OTHER_STEMS  lists for the resulting stem. If it is found the procedure is finished otherwise it is repeated for the next move of the cursor. The finishing conditions are met when stem is found in LEGAL_STEMS or OTHER_STEMS or the next move of the cursor is not allowed because the rule controlling the shortest possible stem fired (MOVING_ALLOWED condition in step 4). In the latter case the stem that exits the procedure is the stem resulting from the last successful stemming cycle for given word. The stem that exits the procedure and was not found in LEGAL_STEMS or OTHER_STEMS is included in the NEW_STEMS list. If none of the consecutive stems produced for the given word were found in the LEGAL_STEMS or OTHER_STEMS lists the final result is the same as it would be with the generic stemmer. Thus, in the worst case the optimal stemmer degenerates to the generic stemmer.

The maintenance of the stemmer apparatus

NEW_STEMS list collects all stems that were produced after the last evaluation of the list and were not found in the LEGAL_STEMS or OTHER_STEMS. Along with stems the original words are also collected. Stems in the list were included into the document database as indexing terms but were not yet evaluated by the stemming system administrator. Occassionally he or she uses a "housekeeping tool" for the evaluation of the list's contents. To facilitate the evaluation the "housekeeping tool" shows for each stem in the NEW_STEMS list  the parts of LEGAL_STEMS and OTHER_STEMS lists that are alphabetically near that stem. Stemming administrator inspects each stem in NEW_STEMS list and marks it with one of 3 marks:

Stems marked with G or O could stil be changed manually during evaluation by stemming administrator in order to better represent the original word type. The imperative is that any change conforms to stemming rules otherwise new stem would never be used in future stemming. Therefore the "housekeeping tool" controlls such changes. After inspection and mark-up of the whole list the "housekeeping tool" distributes the stems as follows: stems marked with G enter LEGAL_STEMS list, stems marked with O enter OTHER_STEMS list, and stems marked with B are deleted from the list. To maintain the synchronisation of lists and database all stems in NEW_STEMS and OTHER_STEMS lists changed during the evaluation are also immediately changed in database's index files while stems marked as B are deleted from there.

The only reason for existence of OTHER_STEMS list is to maintain the LEGAL_STEMS list easily surveyable. All stems that belong to particular language (medical sublanguage in our case) and only those stems are being gathered in LEGAL_STEMS leaving the OTHER_STEMS list to collect all other stems which are legal indexing terms but are not the main point of interest in building document database.

Stems in LEGAL_STEMS list were selected by the system's administrator as optimal stems for the words that produced them. If housekeeping of NEW_STEMS list is performed frequently enough almost all indexing terms in database are believed to be optimal and this fact is the reason that we named the stemmer optimal stemmer for Slovenian.

The stemmer and the housekeeping tool are written in Java.

Experiments with learning of the optimal stemmer 

To be included in future.

Sub-optimal variant of stemmer for Slovenian language

Important part of the subproject is machine learning of user profiles. With user profiles we are modeling the topics of users' interests. The topic in each model is represented with stems extracted from documents which the particular user marked as relevant. User profiles are used to retrieve relevant documents from document databases on Internet. Queries in forms appropriate for particular search engines (AltaVista, Yahoo, Excite...) are automatically constructed from stems in profiles and sent to search engines.

Search engines on Internet presently are not concerned with complicated stemming of texts in "small" languages. Users compose queries in such languages by manual stemming. Query elements produced in that way are substrings of words terminated with wildcards. Stems produced with optimal stemmer for Slovenian language are often transformed by recoding rules in such way that stems are not substrings of the original word and are therefore not suitable for automatic construction of queries. Using such stems would inevitably decrease search recall. For this reason the stems used in construction of user profiles are produced with generic stemmer without the third step -- the recoding rules. Theoretically it would be better to use the optimal stemmer (without recoding) but then the whole stemmer apparatus, different from existing one should be constructed and maintained. We deliberately sacrified some search precision to avoid such work overhead.


X

OPOZORILO : Pregledujete staro stran IBMI

Vsebine na strani so zastarele in se ne posodabljajo več. Stara stran zajema določene članke in vsebine, ki pa morajo biti še vedno dostopne.

Za nove, posodobljene vsebine se obrnite na http://ibmi.mf.uni-lj.si/