NaCTeM

Batch submission to TerMine (request access)

When you submit a batch request, your job will enter a queue. When your job is complete, you will receive an email containing the URL where you can view the results.

Please note: if you want to analyze a PDF document, you must specify a URL. PDF uploading is not currently supported.

Enter URL: or
Enter Name of File to Upload:

Select type of file: Text HTML PDF

Select Parser:

Enter Email Address for Notification:

About the C-value and TerMine ...

Technical terms are important for knowledge mining, especially in the bio-medical area where vast amount of documents are available. A domain independent method for term recognition is very useful to automatically recognize terms from documents.

C-value is a domain-independent method for automatic term recognition (ATR) which combines linguistic and statistical analyses; emphasis being placed on the statistical part. The linguistic analysis enumerates all candidate terms in a given text by applying part-of-speech tagging, extracting word sequences of adjectives/nouns based, and stop-list. The statistical analysis assigns a termhood to a candidate term by using the following four characteristics:

  • the occurrence frequency of the candidate term
  • the frequency of the candidate term as part of other longer candidate terms
  • the number of these longer candidate terms
  • the length of the candidate term

We have been developing a system for terminological management called TerMine. It employs the C-value method to extract terms. The implementation is optimized for scalability and processing speed: given a set of 1.3 million MEDLINE abstracts (2GB text), TerMine (standalone version) extracts 9.8 million term candidates and their termhood scores in about ten minutes.