Workshop Overview


The workshop will feature an invited talk by Dr. Georgios Paliouras, head of the Division of Intelligent Systems at the Institute of Informatics and Telecommunications, National Center for Scientific Reseach "Demokritos", Athens, Greece and coordinator of the BioASQ project, which organises challenges on large-scale biomedical semantic indexing and question answering. ***

Over the past years, there has been an exponential growth in amount of biomedical and health information available in digital form. In addition to the 23 million references to biomedical literature currently available in PubMed, other sources of information are becoming more readily available. For example, digitisation efforts have resulted in the ready availability of large volumes of historical material, there is a wealth of information available in clinical records, whilst the growing popularity of social media channels has resulted in the creation of various specialised groups. Extensive information is available in available in languages other than English, e.g. much medical literature is written especially in Chinese, but to a certain extent also in Japanese, Korean and Russian.

With such a deluge of information at their fingertips, domain experts and health professionals have an ever-increasing need for tools that can help them to isolate relevant nuggets of information in a timely and efficient manner, regardless of both information source and mother tongue. However, this goal presents many new challenges in analysis and search. For example, given the highly multilingual nature of available information, it is important that language barriers do not result in vital information being missed. In addition, different information sources cover varying topics and contain differing styles of language, while varying terminology may be used by lay persons, academics and health professionals. There is also often little standardisation amongst the extensive use of abbreviations found in medical and health-related text.

Building upon the success of workshops on Building and Evaluating Resources for Biomedical Text Mining, held in conjunction with the previous three LREC conferences, the Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing (BioTxtM 2014) aims to bring together researchers who have designed, created, adapted or evaluated biomedical and health text resources, those who are making use of such resources in their tools and applications (text mining, multilingual search, machine translation, information extraction, question-answering, document authoring, etc.), and domain experts/health professionals who would benefit from the use of such resources and tools. The workshop will allow an assessment of the current state of the art of resources, and will provide a forum for the discussion of current problems, ideas, questions and open issues. This will help to identify both future directions for research and new potential collaborations between members of the community. We particularly welcome submissions that deal with resources that deal with languages other than English, or which facilitate multilingual access to information.


Applications in the health and biomedical domain are reliant on high quality resources. These include databases and ontologies (e.g., Biothesaurus, UMLS Metathesaurus) and lexica (e.g., BioLexicon and UMLS SPECIALIST lexicon). Given the frequently changing and variable nature of biomedical terminology and abbreviations, combined with the requirement to take multilingual information into account, there is an urgent need to investigate new ways of creating, updating such resources, or adapting them to new languages. New techniques may include combining semi-automatic methods, machine translation techniques, crowdsourcing or other collaborative efforts.

Community shared tasks and challenges (e.g. Biocreative I-IV, ACL BioNLP Shared Tasks (2009-2011-2013) etc.) have resulted in an increase in the number of annotated corpora, covering an ever-expanding range of sub-domains and annotation types. Such corpora are helping to steer research efforts to focus on open research problems, as well as encouraging the development of increasingly adaptable and wider coverage text mining tools.

Interoperability and reuse are also vital considerations, as evidenced by efforts such as the BioCreative Interoperability Initiative (BioC) and the UIMA architecture. Several of the corpora introduced above are compliant with both BioC and UIMA, and are available within the U-Compare and Argo systems, which allow easy construction of NLP workflows and evaluation against gold standard corpora.

There is also a need to consider how resources and techniques can facilitate easier access to information relevant information that is written in a variety of different languages. For example, can existing techniques and resources used for machine translation, multilingual search and question answering in other domains be adapted simplify access to multilingual information in the biomedical and health domains?

Call for papers

We invite papers reporting on resources that support the application of biomedical text mining to various text types/information sources, biomedical sub-domains and languages, and the process of designing, building, updating, delivering, using and evaluating such resources for various purposes. The workshop will focus both on the lexical and knowledge repositories themselves (e.g., terminologies, ontologies, controlled vocabularies, factual databases, annotated corpora, etc.) as well as on issues relating to their usability (e.g., design guidelines, standards for building resources, storage and exchange formats, interoperability issues, etc.) and on the different ways in which they are being employed by applications and tools to facilitate information access.

The workshop will act as a stimulus for the discussion of several ongoing research questions driving current and future research in the area of biomedical and clinical text mining, in order to support access to information from a range of sources and written in a variety of languages. These questions include the following:

  • Among the available resources, which are the most used? What makes a good resource? How can we ensure that resources are maintained and updated?
  • Which types of resources are still lacking and what is needed urgently? Are any resources planned or in development to address such gaps?
  • Can existing resources sufficiently support text mining and synthesis of information from multiple text types/channels and biomedical subdomains? How can active learning and crowdsourcing improve the coverage of existing resources?
  • Which resources are available that cover languages other than English? Can existing resources/techniques (e.g. machine translation) be used to bootstrap the development of resources for other languages? Are these resources sufficient to support multilingual access and search of relevant information?
  • How easily can resources be employed for different purposes? What efforts have been made to make resources reusable or interoperable? To what extent have these efforts been successful?
  • How can machine translation, multilingual search and question answering simplify access to multilingual information?
  • Can automated processing of multilingual documents make the process of synthesizing information from multiple sources more efficient?
  • How can we involve medical professionals and biologists to provide documentation and annotate text suitable for machine analysis?
  • How well do current technologies for search, machine translation, question answering, etc. work in facilitating the efficient and effective location of information in biomedical and health-related text, from a number of different sources?

Topics of interest include but are not limited to:

  • Building biomedical and health resources for various languages : controlled vocabularies, terminologies, ontologies, corpora, multi-lingual resources
  • Guidelines, annotation schemas, annotation tools
  • eengineering existing biomedical or general language resources
  • Semi-automatic and/or collaborative methods for the update, evolution, extension or enrichment of resources
  • Adapting resources to new sub-domains, text types or languages
  • Interoperability of resources and standards
  • Lightly annotated and noisy resources
  • Tools for the exploration of resources
  • Data exchange formats
  • Evaluation, comparison and critical assessment of resources/ evaluation metrics
  • Innovative employment of resources in tools and applications, for both monolingual and multilingual access to biomedical and health-related information within from a variety of textual sources
  • Evaluation of tools, applications and technologies making use of biomedical and health-related resources