aNT: Research on Advanced Natural Language Processing and Text Mining

aNT (advanced NLP and Text Mining) is a five year project which started in April 2006. The technological focus of the project is on deep parsing and knowledge-based processing with a strong emphasis on combining these with machine learning techniques. From an application standpoint, we aim to develop intelligent text management systems (information extraction, semantically enriched text, information retrieval, etc.), particularly for biomedical domains, along with a general set of NLP tools that are both adaptable for other domains and interoperable with tools developed by other groups.

GENIA Project

The GENIA project seeks to automatically extract useful information from texts written by scientists to overcome the problems caused by information overload. We intend that while the methods are customized for application in the micro-biology domain, the basic methods should be generalisable to knowledge acquisition in other scientific and engineering domains.

We are currently working on the key task of extracting event information about protein interactions. This type of information extraction requires the joint effort of many sources of knowledge, which we are now developing. These include a parser, ontology, thesaurus and domain dictionaries as well as supervised learning models.


U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework which provides access to a large collection of ready-to-use, interoperable, natural language processing components. U-Compare allows users to build complex NLP workflows from these components via an easy drag-and-drop interface, and makes visualization and comparison of the outputs of these workflows simple.

As the name implies comparison of components and workflows is a central feature of the system. U-Compare allows sets of components to be run in parallel on the same inputs and then automatically generates statistics for all possible combinations of these components. Once a workflow has been created in U-Compare it can be exported and shared with other users or used with other UIMA compatible tools and so in addition to comparison, U-Compare also functions as a general purpose workflow creation tool.

U-Compare is a joint project between the University of Tokyo, the Center for Computational Pharmacology (CCP) at the University of Colorado Health Science Center, and the National Centre for Text Mining (NaCTeM) at the University of Manchester.

Kototoi Project(2000-2005)

We investigate new approaches and techniques for general users to extract information from widely distributed general-domain texts rather than exploiting old techniques or relying only on ad-hoc techniques in a specific domain. We suppose that the process of information retrieval can be seperated into the process dependent on users' and writers' individual situation and the process independent of users' knowledge, demands, and situation. The former is, for example, to transform or map ontologies, and the latter is to assign structures independent of users' situation to texts.

Our study features in indexing and structuring user-independent text information prior to retrieval, and facilitating the user-dependent process by exploiting the indexes and structures assigned to texts. In detail, we study i) an intelligent text archive where typed feature structures are assigned to texts, ii) scalable distributed software, including a web crawler, iii) mappings between heterogeneous ontologies, and iv) intelligent agents which negotiate with users to provide better views or information to users.