Final report due. Your final project report is due, along with a group evaluation for those working in groups (see below). The final report must be a full-length (equivalent to 10-14 single column pages) conference-style paper discussing your project. You should model your paper on some of the papers we've read this term. Either PDF by e-mail or a hard copy in my box is fine. If you give a hard copy, I'd appreciate an e-mail letting me know that you've turned it in. Note that I don't care about the format; I only specify the length in single pages because otherwise people ask if I mean single or double column pages. The goal of saying "conference style paper" is that I want you to include things like:
In addition to the report, I want each person who is working on a project in a group to SEPARATELY turn in a report on how they felt that all of the group members (yourself included) contributed to the project. Useful information is: what parts of the project you did (e.g., if you divided the work by sections, who did what section), how many hours you estimated that you worked, and how well you feel like you and the other people in the project did.
In the information retrieval (IR) world, there has been considerable recent work on so-called question answering (QA) systems, which attempt to answer natural language questions from a document corpus. There are various types of QA systems handling different types of questions [1] (e.g. factoid, definitional, list, hypothetical). A factoid question is one which does not require reasoning, and can be answered by a named entity or short noun phrase [4]. We believe that there are an immense number of factoids hidden in a project's document repository, and that extraction, search, retrieval, and caching of these factoids would be of immense use to project members.
In a typical QA system, the user enters a question in natural language, such as "what is the power output of the photovoltaic system?". The system must then parse the question, identifying its various parts of speech. A keyword query is generated, which is used to retrieve documents from the repository which may contain the answer. Particular passages from those documents which could contain the answer are retrieved, and potential answers are pinpointed from those passages [2].
If we restrict ourselves to factoid questions, and if we restrict ourselves to questions which request the value of a property of some entity (e.g. power output of photovoltaic system, floor area of building), we may be able to reduce the complexity of our solution.
Let's consider the photovoltaic system example. The user would like to know the power output of the proposed photovoltaic system. The user enters "power output" as the property of interest, and "photovoltaic system" as the entity of interest. The system uses a lexical network or ontology of AEC terms (e.g. to tell us that BIPV is an acronym meaning "building integrated photovoltaic system") to generate a keyword query which locates relevant documents, and relevant passages of those documents. The following passages are returned:
Here we limit ourselves to the "middle" steps of the QA process. Generating a keyword query using term expansion, document retrieval, and passage retrieval. In [5], it is argued that many users prefer passages to exact answers.
Outside of QA, we then have to worry about how to cache factoids, and how to search the cache. In [6], question, answer pairs are cached, and incoming questions are compared semantically to cached questions. Given that we wish to restrict ourselves to factoids which concern the value for a given property of an entity, a (subject, predicate, object) representation as used by RDF may be appropriate. Some challenges in this case are how to unambiguously represent subjects and predicates, representing and using meta-info about factoids (e.g. document the factoid came from, date, author), and keeping the fact cache up to date as new documents are uploaded to the repository.
[1] - Question Answering. From
Wikipedia. http://en.wikipedia.org/wiki/Question_answering.
[2] - Evaluation of resources for question answering evaluation. Jimmy
Lin. SIGIR 2005.
[3] - Mining the web: discovering knowledge from hypertext
data. Saumen Chakrabarti. Morgan Kaufmann. 2003.
[4] Genetic soft pattern models for definitional question
answering. Cui et al. SIGIR 2005.
[5] Question answering passage retrieval using dependency
relations. Cui et al. SIGIR 2005.
[6] Experiments with Interactive Question Answering in Complex
Scenarios. Hickl et al. HLT-NAACL 2004.
[7] Learning to Extract Information from Semi-structured Text using a
Discriminative Context Free Grammar. Viola and Narasimhan. SIGIR 2005.
[8] Learning surface text patterns for a question answering
system. Ravichandran and Hovy. ACL 2002.
[9] Providing Answers to Questions from Automatically Collected Web
Pages for Intelligent Decision Making in the Construction
Sector. Journal of Computing in Civil Engineering. Vol 22, Issue 1. pp
3-13. 2008.
This project is part of a project on managing civil engineering documents.
Other than dealing with PDF files, a significant challenge in inferring structure from tables of data has to do with the fact that the tables are organized for human consumption, and often provide a view on the data for the purpose of a particular type of comparison, rather than the most natural representation. We then have the problem of inferring the correct data representation from what is shown in the table.
This project is part of a project on managing civil engineering documents.
Conferences: