Annual report of DS workpackage team

with emphasis on personalised search engine

The main elements of the annual report are the following:

The results of the last reporting period are available to general public here.

Summary of subproject's activities and results

Activities

Year I

During the first year of the subproject the research and development was focused to the automatic subject description of Slovenian and English medical texts. The activities were mainly devoted to (a) a tokenizer for HTML files, (b) stop-word lists for Slovene and English languages, and (c) stemming procedures for Slovene language.

A partial deviation from the original plans was introduced into the preparation of the deliverables for the first year. In the subproject proposal the document segmentation scheme has been planned as a part of the deliverable, presumably leading to the more accurate computation of document relevance in the process of searching. This scheme has been rejected and the reasons for this deviation originate in the fact that since the submission of the proposal the proportion of unstructured (non-HTML-marked) texts available on WWW has been greatly diminished. In HTML-marked documents the document segmentation is imposed to documents by authors themselves using the various heading tags.

The time and resources anticipated for document segmentation research were used for the design and the beginning of the implementation of document database and search procedures.

Year II

Activities vere devoted mainly to the development of the pilot database and search engine as testbeds for important concepts that had to be solved before final construction of information retrieval system which is the main deliverable of the subproject. Those concepts are mainly automatic indexing of Slovenian documents, non-Boolean searching and ranking algorithms, including relevance feedback searching and statefulness of search engine.

During the 2nd year two new stemming algorithms for Slovenian were developed and the first experiments on machine learning of user's information interests were performed.

Year III

Resources and activities for the 3rd year were split among three lines of research and development, leading to (a) the final version of the indexing and search engine, (b) algorithms for learning of user's interests and construction of users' profiles, and (c) integration of the search engine and the user modelling modules into the personalised information tool. A minor deviation from the original plan was performed regarding the information filtering, which was abandoned for the two main reasons: (a) the research of information filtering have lost some of its appealing since the time of the proposal submition and its main functions are already covered by some of the most popular mailers, and (b) the personalised information discovery is one of the most important aspects of our work, therefore it would be beneficial to expand the principle of the personalisation to the whole system. The resources freed by this decision were spent on the personalisation of the search engine and its underlying databases.

Results

Among the results that were summarised into the automatic indexing procedures and the search engine are: (a) the tokenizer for HTML-marked documents, (b) stop-words lists for Slovenian and English languages, and (c) lists of endings for Slovenian language. The tokenizer, developed in awk, takes as the input the URLs of documents that has to be indexed, isolates legal words and output them in the standardised way, with the information on the original HTML tags (including nested tags). The tokenizer is coupled to the automatic language determination module. Slovenian stop-words list and lists of endings are based on previous research; in the subproject they were expanded using the word frequency analysis of texts. English stop-words list was compiled from several existing lists.

Three stemmers for Slovenian language were developed in the subproject. The first, 'generic' one, was eventually transformed into 'suboptimal' and 'optimal' stemmers, used in the final system. In both stemmers only the endings dividing at consonant-vowel pairs were retained from the extensive list used in 'generic' stemmer and the set of rules for the transformation of consonant-consonant pairs was added. The additional set of recoding rules was developed in the case of the 'optimal' stemmer. The 'suboptimal' stemmer produce stems that are always contained in original words and is used during the automatic discovery of relevant documents outside the system's databases. The 'optimal' stemmer is an important innovation in its field and is used for automatic indexing of Slovenian documents and queries. It presents a novel approach among statistical stemmers for morphologically rich languages: it is not the longest-match despite the use of endings lists, it is adapted to the professional sublanguage, and it is self-learning.

Three corpora of medical texts were compiled as a testbed for indexing and searching algorithms. Two of them are nearly parallel for both (Slovenian and English) languages. All three are available on the subprojects web pages.

Two versions of document description databases and search engines were developed. The first, testbed version of database was developed with Oracle RDBMS (Ver. 7) with programs for indexing, ranking, and searching written in PL/SQL. Functions of user interface were performed with HTML forms. Communication between Web browser and database was developed with CGI scripts. It was used for the development and testing of algorithms that were later included into the final deliverable.

The modules for the machine learning of users' information needs use the documents that are marked as relevant during the relevance feedback process of searching. Based on the analysis of marked documents the user profiles are gradualy developed and used for the construction of queries which are sent to the major Internet search engines. Hits are collected, best of them merged, re-ranked and presented to the user.

The final version of search engine was developed in Java programming language and united with machine learning modules into the personalised information tool, which is (in its beta version) the main deliverable of the subproject.

Several papers describing the work on the subproject were presented on various conferences. Three papers describing the 'optimal' stemmer, machine learning of users' information needs, and the whole system are in the preparation phase.

Subproject's results implementation plan

Technology Implementation Plan

Definition of Product Objectives

Existing public search engines are not suitable for relatively small document databases of predominantly non-English documents of specialised content and with well defined user populations. Comparative advantages of our search engine over the existing are:

With the search engine, which is the main deliverable of the subproject, the users from the Slovenian medical community obtained an information tool that will be used to store and retrieve documents originating in their community and to automatically gather relevant documents from other parts of the Internet.

Definition of the Product

The product, developed during the project's duration to the beta stage, is a specialised search engine intended to be used in medical environment with Slovenian and English documents and users belonging to student and professional communities. The product consists of databases of stems and pointers to documents, search engine proper, machine learning modules and persistent storage of users' actions and search results. Document databases, search engine, and user interface are implemented in Java programming language, while Oracle RDBMS is used for the persistent storage. Machine learning of user profiles is implemented with inductive logic programming tool FOIL.

The main research contributions of the subproject deliverables are the stemmers for Slovenian, search module with the relevance feedback capabilities, personalisation mechanisms and evolutive person/topic profiles.

Analysis of Intended Market

The products of the subproject are not directly aimed at the software market but will be used as an information tool in Slovenian medicine and will hopefully lead to the better utilisation of public funds by improving the information supply. However, several modules composing the search engine could be interesting to other user communities, stemmer algorithms and tools for development and maintenance of stemmers are among them.

The planned domain of use is medicine with the Slovenian medical professional and student communities as users. The main product, the search engine, is not competing with the existing public search engines on the Internet. The relationship is rather of complementary nature. Our search engine is using public search engines for automated information discovery, but focusing on local documents and providing sophisticated and personalised access to them. In that way our search engine is occupying the niche of the information management that can not be covered by general search engines but is important for the document publication, storage and access in the local research and professional community.

Implementation Strategy

The search engine will be accessible to the general public through the homepages of the Faculty of Medicine in Ljubljana, homepage of the CRII consortium, and homepages of electronic journals in the Digital library of Slovenian medical documents. Databases will store documents from servers in medical domain.

The promotion will be performed with papers on national and international conferences and in scientific journals. Two papers on conferences on Slovenian (Language Technologies for thr Slovene Language IS98) and European level (European Library Automation Group 99) were already presented and a paper for Medical Information Europe 99 is accepted. Three papers for the leading journals from the IR, computerised linguistics and machine learning fields are in preparation.

Education and training of users will be organised through existing courses for students on graduate and postgraduate levels at the Faculty of Medicine.

We see the continuing development of our information tool as an important part of the implementation strategy. We are planning to use and further develop parts of the search engine in a project proposed for the 5th Framework.


X

OPOZORILO : Pregledujete staro stran IBMI

Vsebine na strani so zastarele in se ne posodabljajo več. Stara stran zajema določene članke in vsebine, ki pa morajo biti še vedno dostopne.

Za nove, posodobljene vsebine se obrnite na http://ibmi.mf.uni-lj.si/