In the BOOTStrep project, we aim at building reusable wide-coverage lexical, conceptual and factual knowledge resources for the biology domain. On the one hand, we want to exploit already existing terminological resources (thesauri, classification systems, etc.) and combine them within a common, standardized representation framework. On the other hand, the gaps we encounter shall be filled incrementally by automatically acquiring new terms and concepts from the vast amount of literature in the field. In addition, we shall extract information about the linguistic properties of biological terms, which are not covered by any existing resource, though needed before knowledgeintensive applications, such as text mining, can deliver high quality results of real interest to users. Based on such sophisticated forms of term management and ontology engineering we then intend to create, incrementally maintain and continuously update a repository of biological facts extracted in a fully automatic way from biological documents fed into such a system.
From a resource-oriented perspective, we shall provide a bio lexicon which covers lexical forms (in English only) and their relevant linguistic information from the biology domain. Similar work will be carried out to set up an associated bio domain ontology. While still under development, the lexical resource can already be exploited to feed a sublanguage text analyser which recognises unknown terms, (some of) their morphological and syntactic attributes as well as semantic relations they have to other already known terms. This procedure leads to the incremental enhancement of the initial lexicon and the initial domain ontology. Simultaneously, the text analyser finds concrete biological facts in the documents being processed, which are stored in a biological fact repository. Both this fact database and the domain ontology form a comprehensive, continuously growing biological knowledge repository.
We stipulate that such an environment which bridges work on heterogeneous forms of biological terminologies, lexicons, ontologies and fact databases is a major step towards increased semantic interoperability for all actors involved in the biology domain (scientists, clinicians, bio tech industry and business). In order to further ease access to the growing body of factual biological information we also supply a multilingual query interface (for English, German, Italian and French). The project will deliver reusable large-scale resources and resource-building tools for text-based knowledge harvesting. The results we achieve with this resource and tool suite will be evaluated in the context of several biological application scenarios.