NaCTeM

RF-TermAlign

F-TermAlign is a term alignment tool (extracts bilingual dictionaries of terms).

It adopts a machine learning method, a Random Forest classifier, to learn string similarity of terms between a source and target language. RF-TermAlign uses the WEKA implementation of RF. RF-TermAlign can be applied to comparable corpora where there is a skewed distribution of positive and negative examples. For n terms in a source (English) corpus and m candidate translations in a target(Spanish) corpus, term alignment methods need to make n*m comparisons. RF-TermAlign has been applied to an English-Spanish comparable corpus of wikipedia pages containing 1,200 source terms and 32,347 candidate translations making in total approximately 38 million comparisons. The output of RF-TermAlign is a bilingual dictionary containing N ranked candidate translations for each source term. Candidates are ranked using the classification margin (confidence of the classifiert).

The source code is available to download from here.

If you use RF-TermAlign for research, please cite this paper as follows:

Kontonatsios, G. and Korkontzelos, I. and Tsujii, J. and Ananiadou, S. 2014. Using a Random Forest Classifier to Compile Bilingual Dictionaries of Technical Terms from Comparable Corpora, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers