Over the past years, there has been an exponential growth in amount of biomedical and health information available in digital form. In addition to the 23 million references to biomedical literature currently available in PubMed, other sources of information are becoming more readily available. For example, digitisation efforts have resulted in the ready availability of large volumes of historical material, there is a wealth of information available in clinical records, whilst the growing popularity of social media channels has resulted in the creation of various specialised groups. Extensive information is available in available in languages other than English, e.g. much medical literature is written especially in Chinese, but to a certain extent also in Japanese, Korean and Russian.
With such a deluge of information at their fingertips, domain experts and health professionals have an ever-increasing need for tools that can help them to isolate relevant nuggets of information in a timely and efficient manner, regardless of both information source and mother tongue. However, this goal presents many new challenges in analysis and search. For example, given the highly multilingual nature of available information, it is important that language barriers do not result in vital information being missed. In addition, different information sources cover varying topics and contain differing styles of language, while varying terminology may be used by lay persons, academics and health professionals. There is also often little standardisation amongst the extensive use of abbreviations found in medical and health-related text.
Building upon the success of workshops on Building and Evaluating Resources for Biomedical Text Mining, held in conjunction with the previous three LREC conferences, the Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing (BioTxtM 2014) aims to bring together researchers who have designed, created, adapted or evaluated biomedical and health text resources, those who are making use of such resources in their tools and applications (text mining, multilingual search, machine translation, information extraction, question-answering, document authoring, etc.), and domain experts/health professionals who would benefit from the use of such resources and tools. The workshop will allow an assessment of the current state of the art of resources, and will provide a forum for the discussion of current problems, ideas, questions and open issues. This will help to identify both future directions for research and new potential collaborations between members of the community. We particularly welcome submissions that deal with resources that deal with languages other than English, or which facilitate multilingual access to information.
Applications in the health and biomedical domain are reliant on high quality resources. These include databases and ontologies (e.g., Biothesaurus, UMLS Metathesaurus) and lexica (e.g., BioLexicon and UMLS SPECIALIST lexicon). Given the frequently changing and variable nature of biomedical terminology and abbreviations, combined with the requirement to take multilingual information into account, there is an urgent need to investigate new ways of creating, updating such resources, or adapting them to new languages. New techniques may include combining semi-automatic methods, machine translation techniques, crowdsourcing or other collaborative efforts.
Community shared tasks and challenges (e.g. Biocreative I-IV, ACL BioNLP Shared Tasks (2009-2011-2013) etc.) have resulted in an increase in the number of annotated corpora, covering an ever-expanding range of sub-domains and annotation types. Such corpora are helping to steer research efforts to focus on open research problems, as well as encouraging the development of increasingly adaptable and wider coverage text mining tools.
Interoperability and reuse are also vital considerations, as evidenced by efforts such as the BioCreative Interoperability Initiative (BioC) and the UIMA architecture. Several of the corpora introduced above are compliant with both BioC and UIMA, and are available within the U-Compare and Argo systems, which allow easy construction of NLP workflows and evaluation against gold standard corpora.
There is also a need to consider how resources and techniques can facilitate easier access to information relevant information that is written in a variety of different languages. For example, can existing techniques and resources used for machine translation, multilingual search and question answering in other domains be adapted simplify access to multilingual information in the biomedical and health domains?
We invite papers reporting on resources that support the application of biomedical text mining to various text types/information sources, biomedical sub-domains and languages, and the process of designing, building, updating, delivering, using and evaluating such resources for various purposes. The workshop will focus both on the lexical and knowledge repositories themselves (e.g., terminologies, ontologies, controlled vocabularies, factual databases, annotated corpora, etc.) as well as on issues relating to their usability (e.g., design guidelines, standards for building resources, storage and exchange formats, interoperability issues, etc.) and on the different ways in which they are being employed by applications and tools to facilitate information access.
The workshop will act as a stimulus for the discussion of several ongoing research questions driving current and future research in the area of biomedical and clinical text mining, in order to support access to information from a range of sources and written in a variety of languages. These questions include the following:
Topics of interest include but are not limited to: