Protein Name Recognition by Approximate String Matching Algorithm

Introduction

One can extract the informaiton of protein-protein interaction from MEDLINE abstracts. They might be noisy but the cost to obtain the information is considerably lower than that by biochemical experiments.

The nice thing is that you can obtain the information of protein ids, which are described in the dictionaries.

Protein Name Recognition

Approximate String Matching

An excellent overview is provided by Navarro.
Protein names are extraced from exsisting databases.
PIR-NREF is a non-redundant protein database, which includes the data from PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, PDB.
They are noisy.
Elastic matching is needed.
Not-Adhoc.
Edit distance.
DP-maching

Elastic Matching Algorithm

Edit distance

Cost Function

1
[space]-[hyphen]
[numeral]-[numeral]

TODO

to tune the paramters automatically.
speedup