Research Projects of Sanda Harabagiu
My current research interests focus on Natural Language Processing. I am interested in studying the effects of textual inference on such applications as Information Extraction (IE), Information Retrieval (IR) and Question Answering (Q/A). I am interested in the problem of reference resolution and its contribution to natural language understanding. I am also interested in the relationship between reference and textual cohesion and coherence. My research deals with problems of abductive inference, capable of explaining why a textual answer is correct. For this purpose I use an ehanced version of WordNet, comprising fully-disambiguated definitions and enhanced relational semantics. This new resource is very helpful for incorporating world knowledge into IE and Q/A systems.
Here are some of my current sponsored projects:
NSF CAREER: Reference Resolution for Natural Language Understanding

A major obstacle in building robust systems that extract and interpret information, and summarize and answer questions from texts, is the need to identify the entities referred to by pronouns or other referential expressions. This project extends the PI's prior work involving the development of an empirical reference resolution system that relies on several sets of heuristics that correspond to various forms of reference. In particular, the framework will be extended to learn semantic knowledge that supports consistency checks. This enhancement will provide high precision reference resolution and also enhance substantially the recall of referential links. The research will be evaluated using reference annotated texts and the Penn Treebank corpora. The outcome will be a corpus-based method for reference resolution for both pronouns and nominal expressions. First, the semantics of all referential noun phrases will be captured. Then, by extending the empirical environment with bootstrapping, this reference resolution technique should lead to a powerful tool capable of resolving reference correctly in a large variety of texts. Finally, the tool will be incorporated both in an information extraction system and in a question/answering system, to measure its contribution to the overall performance of these systems. The proposed research departs from previous approaches to reference resolution, in that it promotes data-driven techniques instead of relying on combinations of linguistic and cognitive aspects of language. The immediate pragmatic outcome indicated by the preliminary results should be a substantial recall enhancement. This research is sponsored by the National Science Foundation.

PI: Dr. Sanda Harabagiu
ARDA AQUAINT Computational Implicatures for Advanced Question Answering

The capability of interpreting question implicatures in advanced Question Answering systems is a very important feature. When using a Question Answering system to find information, a professional analyst cannot separate his/her intentions and beliefs from the formulation of the question and therefore (s)he incorporates intentions and beliefs in the interrogation. Moreover, beyond the question, the analyst sometimes makes a proposal or an assertion. This implied information, not recognizable at the syntactic or semantic level, has great importance in the interpretation of a question, and therefore in the quality of the answers returned by a Questions Answering system. This project concerns with the study and development of computational methods that enable coercions of implicatures in the context of advanced Question Answering. This project is sponsored by ARDA.

PI: Dr. Sanda Harabagiu.
ARP: Knowledge Mining for Open-Domain Information Extraction

Nowadays, access to information from large-scale on-line text collections is largely limited to keyword-based searches which retrieve entire documents or passages containing the query keywords. While such tools are often satisfactory for retrieving information on general topics, they provide little support for accessing information involving specific relationships, events or facts.

The Information Extraction (IE) technology enables the generation of structured, tabular representations of selected relations from large text collections - representations which can support more detailed document querying. However, IE systems rely on domain knowledge, thus imposing customization every time when a new topic is considered. This explains why until now, developing extraction systems for a broad range of relations, spanning a large number of semantic domains has been too expensive and time-consuming. This research concerns with the development of the infrastructure that enables open-domain IE.This research is sponsored by the Advanced Research Program of the Texas Higher Education Coordinating Board.

PI: Dr. Sanda Harabagiu
NSF CADRE: A Tool for Transforming WordNet into a Core Knowledge Base

This project extends a popular database of English words to make it more useful in such tasks as question answering, information retrieval, and summarization. Wordnet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical applications. The basic elements of WordNet are sets of words that are linked according to semantic relations: synonymy, antonymy, super-ordination, and so forth. WordNet is publicly available, widely used, and is currently being transformed into a multi-lingual database.

This project develops a set of tools that can be applied to current and future versions of WordNet to extend it for knowledge processing applications. The extensions are enhancements of the glosses that now contain definitions, comments, and examples. The enhanced glosses are part-of-speech tagged, syntactically parsed and semantically disambiguated. In addition, topically related words are clustered by lexical chains generated on the extended WordNet. This research is sponsored by the National Science Foundation.

PI: Dr. Dan Moldovan, Co-PI: Dr. Sanda Harabagiu.