|
dbTalks: Fall 2003
Contact: Wendy Hui Wang,
Jessica Zhao Zheng, dbTalks
Organizer
- Sep 19, 2003 at 2:00pm in CICSR304: DB
Lab Orientation Session
- Sep 25, 2003 at 4:00pm in CICSR104: Maryann
Yan Zhao
- Oct 2, 2003 at 4:00pm in CICSR104: Zhimin
Chen
- Oct 9, 2003 at 4:00pm in CICSR104: Xiaodong
Zhou
- Oct 16, 2003 at 4:00pm in CICSR104: Shaofeng
Bu
- Oct 30, 2003 at 4:00pm in CICSR104: Yuhan
Cai
- Nov 6, 2003 at 4:00pm in CICSR104: Peiqun Yu
(Project Demo)
- Nov 13, 2003 at 3:00pm in CICSR104: Pingdong
Ai
- Nov 20, 2003 at 4:00pm in CICSR104: Annie
Ying (Thesis Presentation)
- Nov 27, 2003 at 4:00pm in CICSR104:
Prof. Ramesh Krishnamurthy(Dept of CS, SFU)
Details:
- Friday, Sept 19, 2003 at 3:00pm
in MCML160
TITLE: DB Lab Orientation Session
SPEAKER: Faculties and Gradudate students in DB lab
- Thursday, Sept 25, 2003 at 4:00pm
in CICSR104
TITLE: Quotient Cube and QC-Tree: Efficient Summarizations
for Semantic OLAP
SPEAKER:
Maryann Yan Zhao (Dept of CS, UBC)
ABSTRACT:
Most data cube compression approaches focused on reducing cube
size but overlooked preserving the semantics of data cube. Quotient
cube was proposed as a summary structure for a data cube that
retains its semantics. It can be constructed very efficiently
and it leads to a significant reduction in the cube size. However,
many issues are unaddressed regarding the application of quotient
cube. For instance, efficient storage and index structure were
not provided; No specific algorithm for query answering was
given; Incremental maintenance against updates was not discussed.
In this thesis work, we are aiming at developing proper solutions
to above open problems. Firstly, we introduce sorted list to
index a direct tabular representation of quotient cube and give
associated query answering algorithms. Since a tabular representation
with additional index structure is neither as compact as it
could be nor efficient to maintain, secondly, we propose a more
efficient data structure called QC-tree to store and index quotient
cube. We present a depth-first search based algorithms for constructing
QC-tree directly from base table. Thirdly, we devise efficient
algorithms for answering various queries and incrementally maintaining
QC-tree against base table updates. An extensive experimental
study illustrates the space and time savings achieved by our
algorithms. Finally, we implement a quotient cube based data
warehouse system to demonstrate our research accomplishments.
- Thursday, Oct 2, 2003 at 4:00pm in
CICSR104
TITLE: Searching the Web
SPEAKER: Zhimin Chen (Dept of CS, UBC)
ABSTRACT:
We offer an overview of current Web search engine design. After
introducing a generic search engine architecture, we examine
each engine component in turn. We cover crawling, local Web
page storage, indexing, and the use of link analysis for boosting
search performance.The most common design and implementation
techniques for each of these components are presented. We draw
for this presentation from the literature, and from our own
experimental search engine testbed. Emphasis is on introducing
the fundamental concepts, and the results of several performance
analyses we conducted to compare different designs.
REFERENCE: Searching the Web - Arasu,
Arvind; Cho, Junghoo; Garcia-Molina, Hector; Paepcke, Andreas;
Raghavan, Sriram
- Thursday, Oct 9, 2003 at 4:00pm in
CICSR104
TITLE:Mining Newsgroups Using Networks Arising From
Social Behavior
SPEAKER: Xiaodong Zhou (Dept of CS, UBC)
ABSTRACT:
Recent advances in information retrieval over hyperlinked corpora
have convincingly demonstrated that links carry less noisy information
than text. We investigate the feasibility of applying link-based
methods in new applications domains. The specific application
we consider is to partition authors into opposite camps within
a given topic in the context of newsgroups. A typical newsgroup
posting consists of one or more quoted lines from another posting
followed by the opinion of the author. This social behavior
gives rise to a network in which the vertices are individuals
and the links represent "responded-to" relationships. An interesting
characteristic of many newsgroups is that people more frequently
respond to a message when they disagree than when they agree.
This behavior is in sharp contrast to the WWW link graph, where
linkage is an indicator of agreement or common interest. By
analyzing the graph structure of the responses, we are able
to effectively classify people into opposite camps. In contrast,
methods based on statistical analysis of text yield low accuracy
on such datasets because the vocabulary used by the two sides
tends to be largely identical, and many newsgroup postings consist
of relatively few words of text.
REFERENCE: Mining Newsgroups Using Networks
Arising From Social Behavior - Rakesh Agrawal, Sridhar
Rajagopalan, Ramakrishnan Srikant, Yirong Xu
- Thursday, Oct 16, 2003 at 4:00pm
in CICSR104
TITLE:Concise Descriptions of Subsets of Structured
Sets
SPEAKER: Shaofeng Bu (Dept of CS, UBC)
ABSTRACT:
We study the problem of economical representation of subsets
of structured sets, that is, sets equipped with a set cover.
Given a structured set U, and a language L whose expressions
define subsets of U, the problem of Minimum Description Length
in L (L-MDL) is: given a subset V of U, find a shortest string
in L that defines V. We show that the simple set cover is enough
to model a number of realistic database structures. We focus
on two important families: hierarchical and multidimensional
organizations. The former is found in the context of semistructured
data such as XML, the latter in the context of statistical and
OLAP databases. In the case of general OLAP databases, data
organization is a mixture of multidimensionality and hierarchy,
which can also be viewed naturally as a structured set. We study
the complexity of the L-MDL problem in several settings, and
provide an efficient algorithm for the hierarchical case. Finally,
we illustrate the application of the theory to summarization
of large result sets, (multi) query optimization for ROLAP queries,
and XML queries.
REFERENCE: Concise Descriptions
of Subsets of Structured Sets - A. Mendelzon, Ken Q. Pu,
Univ of Toronto
- Thursday, Oct 30, 2003 at 4:00pm
in CICSR104
TITLE: Locally Adaptive Dimensionality Reduction for
Indexing Large Time Series Databases
SPEAKER: Yuhan Cai (Dept of CS, UBC)
ABSTRACT:
Similarity search in large time series databases has attracted
much research interest recently. It is a difficult problem because
of the typically high dimensionality of the data. The most promising
solutions involve performing dimensionality reduction on the
data, then indexing the reduced data with a multidimensional
index structure. Many dimensionality reduction techniques have
been proposed, including Singular Value Decomposition (SVD),
the Discrete Fourier transform (DFT), and the Discrete Wavelet
Transform (DWT). In this work we introduce a new dimensionality
reduction technique which we call Adaptive Piecewise Constant
Approximation (APCA). While previous techniques (e.g., SVD,
DFT and DWT) choose a common representation for all the items
in the database that minimizes the global reconstruction error,
APCA approximates each time series by a set of constant value
segments of varying lengths such that their individual reconstruction
errors are minimal. We show how APCA can be indexed using a
multidimensional index structure. We propose two dis
tance measures
in the indexed space that exploit the high fidelity o
f APCA
for fast searching: a lower bounding Euclidean distance approximation,
and a non-lower bounding, but very tight Euclidean distance
approximation and show how they can support fast exact searching,
and even faster approximate searching on the same index structure.
We theoretically and empirically compare APCA to all the other
techniques and demonstrate its superiority.
REFERENCE:
Short version,
Long version - Kaushik Chakrabarti,
Eamonn Keogh, Sharad Mehrotra, Michael Pazzani
- Thursday, Nov 6, 2003 at 4:00pm in
CICSR104
TITLE: A Parallel Pragramming Infrastructure to Support
Interactive Parallel Data Mining.
SPEAKER: Peiqun Yu (Dept of CS, UBC)
ABSTRACT:
There are two main projects that are currently active. The first
is the development of a parallel programming infrastructure
to support interactive parallel data mining. This work is based
on the premise that datamining is more effective when the user
can participate in the discovery process and that the user must
be part of the datamining loop to enable them to guide the process.
However, datamining is both data and compute intensive and parallel
computing, in particular cluster computing, is a useful tool
to achieve the response times necessary to make this approach
feasible. We have focused on three datamining activities: calculation
of correlation matrices, datacube querying, and cluster analysis.
There are two other important aspects to this project: a) It
is an open system to allow the addition of new tools and the
ability for these tools to interact. b) It supports the interactive
mining of large datasets. A result of this goal is that the
algorithms must at any time be able to respond with the "current
best" result from the algorithm even after reading only a portion
of the data. The user can adjust the response time of the algorithms
to give partial results to make decisions on future explorations.
Currently we have a prototype of the tool running to support
the three above mentioned datamining activies. The tool has
a Java frontend and can connected via Corba to an arbitrary
compute cluster running MPI/LAM.
- Thursday, Nov 13, 2003 at 3:00pm
in CICSR104
TITLE: Updating XML
SPEAKER: Pingdong Ai (Dept of CS, UBC)
ABSTRACT:
As XML has developed over the past few years, its role has expanded
beyond its original domain as a semantics-preserving markup
language for online documents, and it is now also the de facto
format for interchanging data between heterogeneous systems.
Data sources export XML "views" over their data, and other systems
can directly import or query these views. As a result, there
has been great interest in languages and systems for expressing
queries over XML data, whether the XML is stored in a repository
or generated as a view over some other data storage format.
Clearly, in order to fully evolve XML into a universal data
representation and sharing format, we must allow users to specify
updates to XML documents and must develop techiniques to process
them efficiently. Update capabilities are important not only
for modifying XML documents, but also for propagating changes
through XML views and for expressing and transmitting changes
to documents. This paper begins by proposing a set of basic
update operations for both ordered and unordered XML data. We
next describe extensions to the proposed standard XML query
language, XQuery, to incorporate the update operations. We then
consider alternative methods for implementing update operations
when the XML data is mapped into a relational database. Finally,
we describe an experimental evalution of the alternative techiniques
for implementing our extensions.
REFERENCE:
Igor Tatarinov, Zachary G.
Ives, Alon Y. Halevy, Daniel S. Weld- SIGMOD Conference
- Thursday, Nov 20, 2003 at 4:00pm
in CICSR104
TITLE:Predicting source code changes by mining revision
history
SPEAKER: Annie Ying (Dept of CS, UBC)
ABSTRACT:
Software developers are often faced with modification tasks
that involve source which is spread across a code base. Some
dependencies between source, such as the dependencies between
platform dependent fragments, cannot be determined by existing
static and dynamic analyses. To help developers identify relevant
source code during a modification task, we have developed an
approach that applies data mining techniques to determine change
patterns---files that were changed together frequently in the
past---from the revision history of the code base. Our hypothesis
is that the change patterns can be used to recommend potentially
relevant source code to a developer performing a modification
task. We show that this approach can reveal valuable dependencies
by applying the approach to the Eclipse and Mozilla open source
projects, and by evaluating the predictability and interestingness
of the recommendations produced for actual modification tasks
on these systems.
- Thursday, Nov 27, 2003 at 4:00pm
in CICSR104
TITLE Subset-conjunctive for Breast Cancer Diagnosis
SPEAKER: Prof. Ramesh Krishnamurthy(Dept of CS, SFU)
ABSTRACT:
The objective of this study was to distinguish within a population
of patients with and without breast cancer. The study was based
on the University of Wisconsin's dataset of 569 patients, of
whom 212 were subsequently found to have breast cancer. A subset-conjunctive
model is described to distinguish between the two groups of
patients. We formulate the problem of inferring subset-conjunctive
rules as a 0-1 integer program, show that it is NP-Hard, and
prove that it admits no polynomial-time constant approximation
algorithm. We examine the performance of a randomized algorithm
and of randomization using LP rounding. In both cases, the expected
performance ratio is arbitrarily bad. We use a deterministic
greedy algorithm to identify a Pareto-efficient set of subset-conjunctive
rules; describe how the rules change with a re-weighting of
the type-I and type-II errors; how the best rule changes with
the subset size; and how much of a tradeoff between the two
types of error is required by selecting a more stringent or
more lax classification rule. An important aspect of the analysis
is that we find a sequence of closely related efficient rules,
which can be readily used in a clinical setting because they
are so simple and have the same structure as the rules currently
used in clinical diagnosis. (Joint work with Rajeev Kohli and
Kamel Jedidi at Columbia University)
|
|
|
|
|