NaCTeM

LogReg-TermAlign

LogReg is a bilingual term extraction tool that can be applied to comparable corpora for compiling large scale bilingual dictionaries. The model uses a character n-gram compositional model based on a logistic regression classifier for learning a string similarity measure of term translations. Specifically, the method uses cross-lingual links between the source and target character n-grams and uses them as second order features to train a linear classifier. A second order feature is a tuple of n-grams in languages S and T, respectively, that co-occur in a training, translation instance. The provided Java code can be used to extract second order n-gram features. In order to train a logistic regression classifier, you will need to download the LIBLINEAR software from

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

The LogReg classifier was previously used for the compilaton of English-French, English-Spanish, English-Greek and English-Japanse bilingual dictionaries.

The source code is available to download from here.

If you use LogReg for research, please cite this paper as follows:

Kontonatsios, G. and Korkontzelos, I. and Tsujii, J. and Ananiadou, S. 2014. Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)