Seminar — Louise Corti

Speaker: Louise Corti, Associate Director, UK Data Archive
Title: Automated indexing of survey questionnaires and interviews
Date: Friday 25th January at 11:00 a.m.
Location: MIB Lecture Theatre (MLG.001), Manchester Interdisciplinary Biocentre

I will talk about use cases for social science applications for text mining in the area of automated indexing of survey questionnaire and transcribed interview data. The ASSERT project at Manchester can be harnessed to adapt and develop the technologies and tools to suit the extraction of terms and concepts from these very specific kinds of documents. There are three main aims for a potentially shared project:

1. The first is to understand better how researchers and data processors manually assign keyword terms to both individual survey questions and to in-depth interview data. In this way the texts are summarised to a single or multiple concept. For the UKDA index terms are assigned from a social science thesaurus (HASSET) are used to help users locate datasets of interest through their online catalogue of 4000 plus collections. Currently there is no evidence to establish how reliable the manual classifying process is, although it is guided by a set of organisational cataloguing rules. The process remains somewhat subjective, and as a manual process is extremely labour intensive. No quality control is in place at the UKDA to check the reliability nor robustness of the terms assigned. The systems from ASSERT could be refined to deal with term extraction and summarisation of these data collections.

2. The second and more practical outcome of a project would be to develop a front-end friendly tool that will assist in the laborious tasks of manually assigning (extracting) key words or concepts to survey questions and qualitative texts. This would likely be Java based and will slot into the work flow of the UK Data Archive data processing. That is, the tools would be completely integrated into the process so that the largely non-technical (and certainly not a unix user) data processor would be able to run the automated text mining tools via a GUI interface and then check and edit with manual intervention.

3. Finally there is exploratory work carried out by The UK Data Archive to be done on the application of named entity recognition tools to qualitative interview data and using this to create basic automated anonymisation tools. While other NLP toolsets were used for this project, joint work could investigate how the ASSERT tools might be adapted to look at coreferencing in spoken interview texts.

Presentation slides