NaCTeM Metabolite and Enzyme Corpus


NOTE: Please observe the terms of the NaCTeM Metabolite and Enzyme Corpus licence when downloading the corpus.

UPDATE 09/01/12: A number of inconsistencies in the corpus have been corrected, and EC numbers have been tagged as enzymes.


Text mining methods have added considerably to our capacity to extract biological knowledge from the literature. Recently, the field of systems biology has begun to model and simulate metabolic networks, requiring knowledge of the set of molecules involved. While genomics and proteomics technologies are able to supply the macromolecular parts list, the metabolites are less easily assembled. Most metabolites are known and reported through the scientific literature, rather than through large-scale experimental surveys. Thus, it is important to recover them from the literature.

In order to provide a means for text mining systems to be trained to recognise metabolites automatically, a corpus has been created in which metabolite names, as well as enzyme names, have been manually annotated by two domain experts. The documents correspond to 296 MEDLINE abstracts from 2007, which were originally included in the version 1 of the yeast metabolic network reconstruction (Herrgård et al., 2008). Annotations of metabolites and enzymes were restricted to only those names that appear in the context of metabolic pathways. For example, in the sentence "glucose is an economically important chemical in the food industry", the role of glucose is not as a metabolite.

The gold-standard (consensus) corpus was created by integrating the manual annotations of the two annotators. Both annotators discussed and checked the gold-standard data. Annotator A is senior to annotator B in terms of annotation experience and years in working in biochemistry, and therefore made the final decision. The two sets of manual annotations were compared to the gold-standard data. The F-scores are 88.49 for Annotator A and 78.35 for Annotator B.

Copus format

The corpus is provided in XML format. The original XML markup provided on the MEDLINE abstracts is retained, and METABOLITE and ENZYME elements have been added. METABOLITE annnotations may be embedded inside ENZYME annotations.

The corpus download distribution contains 2 directories:

  • customize - containing DTDs and a CSS file, allowing the metabolita and enzyme annotations to be viewed visually in a web browser.
  • xmls - The XML files containing the annotations. There is one file per abstract


Herrgård, M. J., Swainston, N., Dobson, P., Dunn, W. B., Arga, K. Y., Arvas, M., Büthgen, N., Borger, S., Costenoble, R., Heinemann, M., Hucka, M., Novère, N. L., Li, P., Liebermeister, W., Mo, M. L., Oliveira, A. P., Petranovic, D., Pettifer, S., Simeonidis, E., Smallbone, K., Spasíc, I.,Weichart, D., Brent, R., Broomhead, D. S., Westerhoff, H. V., Kürdar, B., Penttilä, M., Klipp, E., Palsson, B. Ø., Sauer, U., Oliver, S. G., Mendes, P., Nielsen, J. & Kell, D. B. (2008). A consensus yeast metabolic reconstruction obtained from a community approach to systems biology. Nature Biotechnology 26, 1155–1160.

Nobata, C., Dobson, P., Iqbal, S. A., Mendes, P., Tsujii, J., Kell, D. B. and Ananiadou, S. (2011). Mining Metabolites: Extracting the Yeast Metabolome from the Literature. Metabolomics, 7(1), 94-101.

NaCTeM Metabolite and Enzyme Corpus Licence

1. Copyright of abstracts

Any abstracts contained in this corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).

NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.

NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.

3. Copyright of Metabolite and Enzyme Annotations

Creative Commons License
The metabolite and enzyme annotations in the NacTeM Metabolite and Enzyme Corpus are licensed by NaCTeM under a Creative Commons Attribution 3.0 Unported License.

Please attribute the corpus by citing the following paper:

Nobata, C., Dobson, P., Iqbal, S. A., Mendes, P., Tsujii, J., Kell, D. B. and Ananiadou, S. (2011). Mining Metabolites: Extracting the Yeast Metabolome from the Literature. Metabolomics, 7(1), 94-101.


For any queries relating to the corpus, please contact: sophia.ananiadou at