Spatial Data Mining
Spatial data mining is the branch of data mining that deals with
spatial (location) data. Consider a map of the city of Vancouver
containing various natural and manmade geographic features, and
clusters of points (where each point marks the location of a particular
house). The houses might be noteworthy because of their size, their
historical interest, or their current market value. Clustering algorithms
exist to assign each point to exactly one cluster, with the number
of clusters being defined by the user.
Clustering is a technique that is quite useful in spatial data
mining applications. It divides the initial set of objects into
a number of non-overlapping subsets, in order to identify classes
or groups of objects whose locations (in some k-dimensional space)
are close to each other. This requires a suitable similarity function.
Euclidean distance is a natural choice of similarity function for
spatial data mining. There are many clustering approaches (e.g.,
hierarchical, partitioning, k-means, k-medoids), and we add that
efficiency is very important if the clusters contain many points.
One area of our research identifies spatial relationships between
clusters and features. For example, some of the houses in a particular
cluster may border a major park or may be near some important geographic
feature(s). The objective is to identify likely relationships. The
information is ranked, so that qualitative information is presented
first. For example, trivial information such as "50% of the houses
(in a particular cluster) are within 5 kilometres of an elementary
school" is not very interesting because there are many schools in
a major city. It is much more interesting to note that "64% of the
houses are within 500 metres of John Oliver High School" or that
"22% of the houses border Queen Elizabeth Park". Such information
could be of value to realtors, investors, or prospective home buyers;
however, the techniques employed can also apply to other spatial
applications (e.g., satellite images, photographs, oil and gas exploration).
Given a cluster of houses, how can we tell which features are
close? This problem is not trivial. There may be tens of thousands
of features to consider and the input features can vary significantly
from user to user. Furthermore, many map datasets are represented
in free format without accompanying indexes. We need to be able
to detect relationships among large numbers of objects without incurring
significant overhead.
Aggregate Proximity Relationships
One area of our research involves the problem of efficiently determining
aggregate proximity relationships in spatial data mining. We mention
the word "aggregate" because most clusters are unlikely to have
a uniform distribution of houses. It makes sense to assign a greater
weight to houses nearer a relevant feature's boundary. This is a
form of nearest neighbour search. Although index structures exist
to provide fast look-up, significant pre-processing is required
(e.g., O(n log n) for tree structures or O(k**2 n log n) for order-k
Voronoi Diagrams to find the k nearest neighbours). Such pre-processing
makes index construction questionable in a data mining context because
the number of features (n) can be very large.
We have implemented a (small constant) linear time algorithm (Algorithm
CRH) to find approximate proximity relationships using filtering,
memoization, buckets, and selection. Our filtering approach is based
on approximations to the features and clusters via successive convex
approximations (i.e., circles, rectangles, and convex hulls). We
do not require the use of indexes, and have shown that excellent
performance can be obtained on large ad hoc datasets, without constructing
indexes. Our approach supports scalability and incrementality, both
of which are important properties in spatial data mining.
Concept Generalization
This area of research expands on our work involving aggregate
proximity relationships. Given n input clusters, we want to associate
the clusters with classes of features (e.g., educational institutions
which in turn may be comprised of grade schools and post-secondary
institutions ... and grade schools which may be comprised of public
and private institutions ... and private grade schools which may
be comprised of boys' schools and girls' schools). Specifically,
we want to find classes of features that are in close proximity
to most (or all) of the clusters. For example, if 2 of 3 clusters
are close to girls' private schools, and the other cluster is close
to a boys' private school, then we can generalize these observations
to the fact that all 3 clusters are close to some private grade
school. Alternatively, given two input clusters, we want to find
discriminating classes of features that distinguish one cluster
from the other.
We have proposed two algorithms that use concept generalization
to find such classes of features:
- Algorithm GenCom, which is used to extract commonalities. For
example, if n expensive housing clusters are given as input, GenCom
may discover the general pattern that they are all very close
to the shoreline. Moreover, using concept generalization, GenCom
considers not only individual features, but also classes of features
in pattern extraction. For instance, while it may not be true
that every input cluster is close to one particular golf course,
GenCom may find out that each cluster is close to some golf course
-- not necessarily the same one.
- Algorithm GenDis, which is used to find "discriminating" classes
of features that distinguish one cluster from another. For example,
if an expensive housing cluster and a poor housing cluster are
given as input, GenDis may find that "golf course" and "police
station" are discriminators, in that while the expensive housing
cluster is close to a golf course, the poor housing cluster is
not, and that while the poor housing cluster is close to a police
station, the expensive cluster is not.
Boundary Shape Matching
We are researching a specific property in knowledge discovery
among spatial objects -- namely that of partial boundary shape matching.
As above, our focus is on mining spatial data, whereby a large number
of objects called features (represented as polygons) are compared
with one or more point sets called clusters. Specifically, a cluster
of points is compared to a large number of natural or man-made features
to detect partial or total matches of the facing boundaries of the
cluster and feature. We use a geometric construct called an alpha-shape
to characterize the "shape" of an arbitrary cluster of points, thereby
getting a set of edges denoting the cluster's boundary. Our work
results in an approach for detecting a boundary shape match between
the facing curves of the cluster and feature. Furthermore, the value
of such a match can be quantified.
|