Database Usage in Steerable Multidimensional Scaling
CPSC 533C Project Proposal
Allan Rempel - agr@cs.ubc.ca
November 4, 2005
Domain, Task and Dataset
Personal Expertise
Proposed Infovis Solution
Scenario of Use
Proposed Implementation Approach
Milestones
References
Domain, Task and Dataset
Multidimensional Scaling (MDS) concerns itself with representation of
high-dimensional (multi-variate) data sets in a 2D or 3D form that
can be displayed in an understandable way on a 2D computer monitor.
A straightforward approach would be to simply geometrically project
that high-dimensional data into a 2D plane. However, the techniques presented
in the literature use the high-dimensional distance (or dissimilarity)
between pairs of data elements, rather than the data elements themselves,
to determine their presentation in the 2D space.
Previous research has resulted in the development of the MDSteer++ system,
which allows users to steer the scaling process to follow the most
interesting directions first, and provides techniques to place small
sets of points into bins for further processing in an effort to
efficiently provide better information progressively as the system runs.
Theoretically, the system is able to handle over one million points. [2]
However, the system is practically limited to the number of points that
are able to fit into the working memory of the computer on which it runs.
For larger data sets, performance would be expected to decrease
precipitiously. The data set used thus far in existing tests of MDSteer++
is the Lahman baseball archive; however, other larger data sets will be
of interest as well for the purposes of this project.
Personal Expertise
I have 22 years of computer programming experience and 15 years in C/C++,
which is the language in which MDSteer++ is written. Of that, 10 years
is in industry, and the remainder is university or personal experience.
Most of my experience is with a variety of flavours of unix, with
SGI IRIX and Linux being the most recent. On those platforms, I also
have several years of experience writing database code, particularly MySQL
in C++ with Qt. My information visualization expertise is limited to
the CPSC 533C class for which this project is being developed.
Proposed Infovis Solution
I plan to modify MDSteer++
to incorporate the use of a MySQL database for data storage, so that
not all data needs to be resident in memory. I plan to use the same
data set used in [1], the
Lahman baseball archive. I also plan to analyze runs of the software
with and without the modifications, on data sets of different sizes,
to see whether databases can buy us some scalability when we run into
data sets that exhaust the available memory, and if so, what the costs
of that scalability are and at what point
the benefits of using a database outweigh the costs. In addition,
I intend to use another larger data set, yet to be determined, which
would exceed the memory capacity of a typical computer on which MDSteer++
would run.
The basic (empty) MDSteer++ main user interface window is shown below,
next to an image from [1] that shows what the main window looks like
when the program is running on a sample data set:
Scenario of Use
The use scenario will be the same as it currently is for MDSteer++.
The user runs the MDSteer++ executable on a particular data set
and then watches while the system places the points in the window
in accordance with the algorithm. The system is interactive in that
the user can click on a region of the MDSteer++ window to steer
the computation in the direction that the user is interested in.
One additional feature is that there will be menu options provided to
allow the user to obtain a data set from an existing database server
and table or set of tables.
More information about the use of the system is available in
the README file [2].
Proposed Implementation Approach
I plan to use the Qt library, which has a good MySQL implementation
which should facilitate the development of the database code in MDSteer++.
I have already gotten the system to run on the SuSE Linux machines
in the terminal rooms (CS 106, CS 306) which I expect to use as my
computing platform. As MDSteer++ is written in C++, that is the
language I will use as well.
Milestones
- Get MDSteer++ running under Linux. (Already done.)
- Obtain information on how to use MySQL (existing servers, databases?)
in this computing environment.
- Learn MDSteer++ code base and make appropriate modifications.
- Run tests, gather data, and analyze results.
- (Outside the scope of the project for this class): Fold results into
[1] and submit for publication.
References
- D. Westrom, T. Munzner, and M. Tory. Progressive Binning for Steerable Multidimensional Scaling. Unpublished, 2005.
- Authors Unspecified. A Guide to Using MDSteer++ Alpha Release
(README file), Version 0.5, February 18th 2005.
- M. Williams and T. Munzner. Steerable, Progressive Multidimensional Scaling. In Proc. IEEE Symposium on Information Visualization, pages 57-64, 2004.
- F. Jourdan and G. Melançon. Multiscale hybrid MDS. In Intl. Conf. on Information Visualization (London), pages 338-393, 2004.