NaCTeM

Seminar - Sreeram Balakrishnan

Speaker: Sreeram Balakrishnan (Manager, Unstructured Information Management Architecture group, IBM India Research Lab)
Title: Factoid Extraction from the Web
Date: 12:30, Thursday 26th May
Location: Room F10, MSS Building
Abstract: The World Wide Web has grown into an information-mesh, with the most important facts being reported through websites. While the information is in plenty, its form is heavily unstructured, making it difficult to deploy an automated information retrieval system that could extract useful factoids. We present a new method capable of extracting relevant factoids from unstructured Web data (hypertext). A factoid is a news-item that might be of interest with respect to particular category such as change in leadership; in our case they are motivated by corporate or market changes that can be used for market intelligence purposes. We associate a factoid with a snippet of natural language text. Factoid extraction, for a given category, is formulated as a two-class classification problem. Feature abstraction using named entity annotations is used to ameliorate the data sparsity problem We present a method for learning a category specific classifier from a set of pure hand labelled positives and noisy positive instances generated by smartly querying the Web. The system is evaluated on two particular factoid categories, corporate leadership changes and mergers & acquisitions. The experiments yield promising empirical results. Time permitting I would also like to discuss IBM's open-source text analytics platform UIMA.