| Abstract: |
The World Wide Web has grown into an information-mesh, with the most
important facts being reported through websites. While the information
is in plenty, its form is heavily unstructured, making it difficult to
deploy an automated information retrieval system that could extract useful
factoids. We present a new method capable of extracting relevant factoids
from unstructured Web data (hypertext). A factoid is a news-item that might
be of interest with respect to particular category such as change in leadership;
in our case they are motivated by corporate or market changes that can
be used for market intelligence purposes. We associate a factoid with a
snippet of natural language text. Factoid extraction, for a given category,
is formulated as a two-class classification problem. Feature abstraction
using named entity annotations is used to ameliorate the data sparsity
problem We present a method for learning a category specific classifier
from a set of pure hand labelled positives and noisy positive instances
generated by smartly querying the Web. The system is evaluated on two particular
factoid categories, corporate leadership changes and mergers & acquisitions.
The experiments yield promising empirical results. Time permitting I would
also like to discuss IBM's open-source text analytics platform UIMA. |