Our research group at the University of Tokyo has been granted a five-year project (Grant-in-Aid for Specially Promoted Research) on advanced NLP by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) in Japan. The project, aNT (advanced NLP and Text Mining), started in April 2006, and its technological focus is on deep parsing and knowledge-based processing with a strong emphasis on combining these with machine learning techniques. From an application standpoint, we aim to develop intelligent text management systems (information extraction, semantically enriched text, information retrieval, etc.), particularly for biomedical domains, along with a general set of NLP tools that are both adaptable for other domains and interoperable with tools developed by other groups.
Significant progress has been made in Natural Language Processing and Text Mining for the last ten years. In particular, although “deep” syntactic parsing, based on linguistic theories such as HPSG, CCG, LFG, LTAG , etc., had been slow and fragile for real-world application, recent research in the field has much improved their efficiency and robustness. The rapid progress is accomplished as the result of integration of the two streams of research, that is, research on efficient algorithms for unification-based formal grammar formalisms and that on probabilistic language modeling and machine learning using large corpora.
The aims of the project are to apply the successful research framework to the fields of semantic and pragmatic processing, and to bring breakthroughs to these fields. Despite of the progress in efficiency and robustness, the current “deep” parsing is not still sufficient in their accuracy for many applications. Furthermore, the output of current “deep” parsing is not deep enough for applications such as information extraction, intelligent IR and Q/A, etc. They require deeper layers of representation beyond “deep” syntax. In short, they need layers of representation for interpretation of text.
Straightforward application of the methods successful in syntax to these fields will not work for semantics and pragmatics for several reasons. (1) While different schools of computational linguistics still remain to agree the representation of “deep” syntax, deeper semantics or human interpretation of text is far less tangible than syntax. Before applying probabilistic modeling, we have to construct a semantically (or pragmatically) annotated corpus which requires to build empirically attested theories of semantics and pragmatics. (2) Knowledge-based processing or interpretation of text needs knowledge. In order not to repeat the same mistakes in the early AI research, the knowledge provided should have non-trivial coverage which enables significant computational experiments. (3) Unlike syntactic theories which are monolithic and local in comparison, the knowledge-based semantic processing is heterogeneous and global for integrating various cues in text. We will develop n extended frameworks both in algorithmic theories and probabilistic modeling. (4) We proved in the previous project that the computational resources, including the GRID and distributive computer systems, are crucial for the success of application of deep parsing to the real-world application. Since knowledge-based semantic processing inherently requires huge data resources including ontology, large semantic lexicon, etc., such environment will play a more central role in the future advanced NLP and Text Mining.
In the project, we have the following four research topics.