NaCTeM

Controllable readablity corpus

Introduction

Owing to the highly technical nature of biomedical documents, the ease with which people can understand their content varies according to their level of domain knowledge. While existing biomedical document summarization systems are generally only able to produce highly technical summaries, it would be desirable for them also to be able to produce plain language summaries (PLSs) that can be understood by lay people. In order to support the development of summarization systems that can support this goal, we have produced a corpus consisting of biomedical papers, accompanied both by their technical summaries and by PLSs written by the authors.

Corpus Description

The corpus consists of 28,124 peer-reviewed biomedical research papers along with their technical and PLSs from six PLOS journals that cover a broad range of biomedical research subjects, i.e., PLOS Biology, PLOS Computational Biology, PLOS Genetics, PLOS Medicine, PLOS Neglected Tropical Diseases, and PLOS Pathogens.

The PLSs are taken from the Author Summary section of articles. This section consists of a short, non-technical summary of the article, which is distinct from the abstract, with the goal of making the research accessible both to scientists and non-scientists. This is achieved by highlighting where the work fits within a broader context, presenting the significance in a simple manner and avoiding the use of acronyms and complex terminology.

To construct the corpus, we downloaded the complete PLOS article dataset (as of 4th April 2022), after which we filtered out articles without an Author Summary section. We then extracted the full text, the abstract (as the technical summary), and the Author Summary (as the PLS) from the remaining papers. This resulted in a total of 28,124 document-technical summary-PLS triplets. We randomly sampled 1,000 triplets, respectively, to form the development and test tests, while the remaining 26,124 triplets constitute the training set.

Corpus format

The corpus is provided in JSON Lines format. Separate files are provided containing the training (train_plos.jsonl), development (dev_plos.jsonl) and test (test_plos.jsonl) sets.

Each JSON object corresponds to an article and includes the following five fields:

  • doi - DOI of the article
  • title - Title of the article
  • abstract - Abstract of the article
  • plain language summary - PLS for the article (i.e., the content of the Author Summary section)
  • article - The full text of the article

Availability

The corpus is available for download according to the terms of the licence below.

Related Publication

Luo, Z., Xie, Q. & Ananiadou, S. (2022).Readability Controllable Biomedical Document Summarization. arXiv. https://doi.org/10.48550/ARXIV.2210.04705

Licence

Creative Commons License
The corpus was constructed at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. It is licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus, and please cite the following article:

Luo, Z., Xie, Q. & Ananiadou, S. (2022).Readability Controllable Biomedical Document Summarization. arXiv. https://doi.org/10.48550/ARXIV.2210.04705