Project Information

One of the major components of this class is the project. The point of this project is to delve further into some aspect that we have been studying. You may do your project either alone or in groups of two to three. The amount of work expected from the project is commensurate with the number of people working on it (i.e., you personally are expected to put in the same amount of work on a project regardless of whether you're working alone or in a group). Keep in mind that I do not require that this project be an implementation. A literature survey is a perfectly fine project. This project should not eat your life.

Schedule

Monday, January 26: 2-page proposal due. This should include:
- What problem(s) you want to solve,
- What is going to be new and challenging about it,
- How you will try to solve the problem(s)
- What problems you don't consider to be part of the project (i.e., non-goals)
- What resources you need that you don't already have
- Who is on your team if you are working in a team. Teams are strongly encouraged
Week of February 2: feedback on proposal returned to you
February 25: 5-page midterm status report due; this should describe what you have done, what you have left to do, roadblocks you've encountered, interesting or unexpected questions or issues that uncovered, etc. Included in this report should be 2-3 pages of literature search for related work; this should include both a written component comparing your project to related work, as well as a bibliography
Week of March 2: feedback on status report returned to you
March 30 - April 8: Project presentations. Precise schedule TBD (it's first come first served via requests that are e-mailed to me), but everyone should be prepared for March 30. Your presentation should be ~20 minutes long. Here is what I expect out of the presentation (not necessarily in this order):
- A good description of what the project is (inputs, outputs, etc.)
- The motivation for why this project is interesting - why did you choose to do it, and why should we care about the problem
- A discussion of what makes this project non-trivial (especially for the more research-oriented projects)
- A description of how the project fits into the context of the class
- A presentation of the results thus far
- A discussion of what results you expect to get by the project deadline
- A discussion of the difficulties or surprises that you had when working on the project
- If your project is in a group, everyone must speak.
~~Wednesday, April 8 - 5:00pm~~ Deadline extended to Tuesday, April 14 - 5:00pm
Final report due. Your final project report is due, along with a group evaluation for those working in groups (see below). The final report must be a full-length (equivalent to 10-14 single column pages) conference-style paper discussing your project. You should model your paper on some of the papers we've read this term. Either PDF by e-mail or a hard copy in my box is fine. If you give a hard copy, I'd appreciate an e-mail letting me know that you've turned it in. Note that I don't care about the format; I only specify the length in single pages because otherwise people ask if I mean single or double column pages. The goal of saying "conference style paper" is that I want you to include things like:
- Motivate the problem that you're working on
- Provide an example of a scenario where you'd use your solution
- Tell me about the solution that you've created, this includes telling me about what makes the problem interesting and hard. If you'd like, you can interpret this as telling me what problems you ran into.
- Relate it to related work
- Tell me about potential future work - even if you have no intension of ever doing it. Just like in a real conference paper, the goal is for you to show that you know what some of the flaws are with your system, even if you have no intension of solving them. ;)
Note that some of you won't actually create a solution, but just explore the literature, which is fine. In this case, your job is to explore the strengths and weaknesses of the approaches, and, if you feel like there's an obvious choice, say what you would do if you were going to implement a solution to the problem. Note that I do not care about the layout.
In addition to the report, I want each person who is working on a project in a group to SEPARATELY turn in a report on how they felt that all of the group members (yourself included) contributed to the project. Useful information is: what parts of the project you did (e.g., if you divided the work by sections, who did what section), how many hours you estimated that you worked, and how well you feel like you and the other people in the project did.

Project Ideas

Here are some ideas that would be appropriate for the course project. The best project ideas are likely to come from you; however, here are some that you can use as is or use to think of new ones. The projects can run the gamut from all theory to having a heavy implementation component. I'll add more project ideas as I come up with them.

Most database research topics that you would like to pursue. Keep in mind that I do mean research topics; implementing a database application does not qualify. Feel free to send me mail or come by to talk about what qualifies as a good project.
Managing a hierarchy using relational data model. Background : the (old) Informix architecture manages data naturally in hierarchy. With the "flat" relational model, explore good ways to manage data that has a hierarchy and explore operations that are need to query the hierarchy (e.g. Parent-children relationship). Propose models and optimizations that speed up the operations (e.g., joins). This is part of a project to manage data in a disaster.
Surveying how to manage data in a connected graph style within a relational database. For example, assume that you have data in a relational database, but you want to do things like test connectivity, shortest path, max-flow, etc. This is part of a project to manage data in a disaster.
Factoid Question Answering During meetings, project members need quick access to facts about the project. For example, the total floor area of the building, the cost of the sanitary water treatment, or the power output of a proposed photovoltaic system. These facts can often be found in documents which are stored in the repository, but it may require an prohibitive amount of browsing in order to find them.
In the information retrieval (IR) world, there has been considerable recent work on so-called question answering (QA) systems, which attempt to answer natural language questions from a document corpus. There are various types of QA systems handling different types of questions [1] (e.g. factoid, definitional, list, hypothetical). A factoid question is one which does not require reasoning, and can be answered by a named entity or short noun phrase [4]. We believe that there are an immense number of factoids hidden in a project's document repository, and that extraction, search, retrieval, and caching of these factoids would be of immense use to project members.
In a typical QA system, the user enters a question in natural language, such as "what is the power output of the photovoltaic system?". The system must then parse the question, identifying its various parts of speech. A keyword query is generated, which is used to retrieve documents from the repository which may contain the answer. Particular passages from those documents which could contain the answer are retrieved, and potential answers are pinpointed from those passages [2].
If we restrict ourselves to factoid questions, and if we restrict ourselves to questions which request the value of a property of some entity (e.g. power output of photovoltaic system, floor area of building), we may be able to reduce the complexity of our solution.
Let's consider the photovoltaic system example. The user would like to know the power output of the proposed photovoltaic system. The user enters "power output" as the property of interest, and "photovoltaic system" as the entity of interest. The system uses a lexical network or ontology of AEC terms (e.g. to tell us that BIPV is an acronym meaning "building integrated photovoltaic system") to generate a keyword query which locates relevant documents, and relevant passages of those documents. The following passages are returned:
1. Document: Design/Meetings/BIPV/00 - 02 June 2006 - PreliminaryCostingPV.pdf
  - "SunPower cells, 16.7% BIPV module efficiency for a total of 28.4 kWp. Beside the skylight, potential also exists to install additional BIPV modules (~15kW) on the CIRS central South facade to create largest BIPV installation in Canada. This would also reduce the heat gain in the atrium."
2. Document: Design/Meetings/BIPV/01 - 07 June 06 - BIPV Minutes.pdf
  - "By installing the above-mentioned PV layout, the project would have a BIPV system of approximately 28kW. The cost for such a system is roughly estimated at $350,000 (refer to cost estimate prepared by DT on June 2, 2006) If PV cells are installed on the entire south glazing pane (2600mm wide), the project would then have a system of 42kW, which would make it the largest BIPV system in Canada."
3. Document: Design/Meetings/BIPV/02 - 23 June 2006 - PreliminaryDesignPV_iteration_2b.pdf
  - "If all the available area is used, the installed PV capacity would be 40 kWp."
  - "Figure 3 - Monthly energy production for each section."
  - "This would leave two active sections corresponding to an installed capacity of roughly 26.7 kW. Yearly production would be around 23.6 MWh."
The user selects 26.7kW as a suitable answer, and the factoid "the power output of the photovoltaic system is 26.7kW" is cached for later use.
Here we limit ourselves to the "middle" steps of the QA process. Generating a keyword query using term expansion, document retrieval, and passage retrieval. In [5], it is argued that many users prefer passages to exact answers.
Outside of QA, we then have to worry about how to cache factoids, and how to search the cache. In [6], question, answer pairs are cached, and incoming questions are compared semantically to cached questions. Given that we wish to restrict ourselves to factoids which concern the value for a given property of an entity, a (subject, predicate, object) representation as used by RDF may be appropriate. Some challenges in this case are how to unambiguously represent subjects and predicates, representing and using meta-info about factoids (e.g. document the factoid came from, date, author), and keeping the fact cache up to date as new documents are uploaded to the repository.
[1] - Question Answering. From Wikipedia. http://en.wikipedia.org/wiki/Question_answering.
[2] - Evaluation of resources for question answering evaluation. Jimmy Lin. SIGIR 2005.
[3] - Mining the web: discovering knowledge from hypertext data. Saumen Chakrabarti. Morgan Kaufmann. 2003.
[4] Genetic soft pattern models for definitional question answering. Cui et al. SIGIR 2005.
[5] Question answering passage retrieval using dependency relations. Cui et al. SIGIR 2005.
[6] Experiments with Interactive Question Answering in Complex Scenarios. Hickl et al. HLT-NAACL 2004.
[7] Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar. Viola and Narasimhan. SIGIR 2005.
[8] Learning surface text patterns for a question answering system. Ravichandran and Hovy. ACL 2002.
[9] Providing Answers to Questions from Automatically Collected Web Pages for Intelligent Decision Making in the Construction Sector. Journal of Computing in Civil Engineering. Vol 22, Issue 1. pp 3-13. 2008. This project is part of a project on managing civil engineering documents.
Structured Data Extraction In addition to the large number of unstructured text documents in a project's document repository, there are also a number of sources which contain tables of data, usually in an excel or PDF file. These sources are usually small (less than 100 rows), and each structured differently. It is hard to justify the manual effort necessary in order to set up structured access to each source (e.g. by putting it into a relational DBMS), and to set up mappings between sources. Hence we may want to try and (automatically or semi-automatically) infer the structure of the data stored in these tables and import it into a system which provides structured access.
Other than dealing with PDF files, a significant challenge in inferring structure from tables of data has to do with the fact that the tables are organized for human consumption, and often provide a view on the data for the purpose of a particular type of comparison, rather than the most natural representation. We then have the problem of inferring the correct data representation from what is shown in the table.
This project is part of a project on managing civil engineering documents.
Peer data management systems (which we will discuss in class) are an extension of P2P to the richer semantics of databases. I have some research on creating mappings between the sources to allow the translation of the queries, as well as how to translate the queries. This research could use more experimental analysis. I would tell you what to do, and you would run the experiments as well as learn some of the theory (bonus: you could get a publication out of the matter).
Throughout this course, we'll talk about how the concepts that we study relate to your data. Choose some part of your data that is difficult to manage using current data management techniques/software. Describe what would need to change in order for your data to be managed effectively. Relate to readings both in class and out of class.

A word on plagiarism

Your project, as with all of your work, is to be your work. If you take ideas from anywhere else, you have to cite them, and that if you take words from somewhere else, they have to be quoted and cited (taking names of things is okay without quotes as long as they are well cited, but if you're taking more than that, you need to have it in quotes). Copying other people's text or figures and claiming it as your own is not okay; it is plagiarizing.

Resources

If you are looking for relevant papers, here are some suggestions:

DBLP is a fantastic bibliography and link to papers for database and logic programming.
Google Scholar also has a search engine that can be quite helpful since it indexes more than just the metadata about the paper

For any source, you want to make sure that you're reading the best papers. One way that will often, though not always, lead you in the right direction, is to look at the highly rated venues. In data management, some of those are:

Conferences:

SIGMOD
PODS (theory)
VLDB
EDBT
ICDE

Journals

TODS
VLDB Journal
TKDE

[504 home] [grading] [schedule] [project] [WebCT]

Rachel Pottinger
E-mail Address: rap [at] cs [dot] ubc [dot] ca

Office Location: CICSR 345
Phone: (604)822-0436
Fax:(604)822-5485
Postal/Courier address:
The Department of Computer Science
University of British Columbia
201-2366 Main Mall
Vancouver, B.C. V6T 1Z4
Canada