NaCTeM

New article on descriptive document clustering

2014-12-05

We are pleased to announce that a new paper describing a novel framework for descriptive document clustering has been published in the Journal of the Association for Information Science and Technology.

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarise the content of each document cluster. The novel descriptive clustering framework, called CEDL, has resulted from a collaboration between NaCTeM and colleagues in the School of Electrical Engineering, Electronics and Computer Science at the University of Liverpool, in the context of the MRC-funded project Supporting Evidence-based Public Health Interventions using Text Mining.

Evaluation of CEDL has shown that it can perform robustly across different domains, i.e., clinical trials, newswire and systematic reviews, and that its performance is higher than similar state-of-the-art systems, in terms of the quality of both the purity of the document clusters produced and the sub-topics identified by the automatically assigned descriptive labels.

Full abstract

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarise the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbour-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.

Citation

Mu, T., Goulermas, J. Y, Korkontzelos, I. and Ananiadou, S. (2014). Descriptive Clustering via Discriminant Learning in a Coembedded Space of Multi-level Similarities. Journal of the Association for Information Science and Technology

Previous itemNext item
Back to news summary page