Tools Selection for Teaching
Data Warehousing and Data Mining
ANDY C. H. LAW
Kwantlen University College
ABSTRACT
The
application of data warehousing (DW) and data mining (DM) concepts in data
analysis and decision making has attracted a great deal of interest in the
information technology (IT) industry. In a DW and DM course, students learn how
a data warehouse is developed through a methodological approach. They learn
different techniques in each development phase according to the methodology,
and use different tools to perform the techniques. Since DW and DM cover
various technologies such as database technology, statistics, visualization,
information science, and machine learning, software tools selection during the
course development becomes a major concern. In this paper, I describe two
different approaches of tools selection. The first approach uses an integrated
tool by a single vendor, and the second uses different tools by various
vendors. In each approach, I measure the quality of learning by making
observations and interviews. I find that the students will be more confidence
in solving DW and DM problems using the second approach.
Keywords: Data Warehousing, Data Mining, Software
Tools Selection
1.
Introduction
The
emergence of DW was initially a consequence of the observation by W. Inmon in
the early 1990s that online transaction processing (OLTP) and online analytical
processing (OLAP) cannot coexist efficiently in the same database environment,
mostly due to their very different transaction characteristics [1]. Recently,
the application of DW and DM concepts in data analysis and decision making has
attracted a great deal of interest in the IT industry. Due to the strong demand
of trained data warehouse developers, Kwantlen University College (KUC), B.C.,
Canada, has been offering a fourth year level course in DW and DM in its
Bachelor of Technology in IT (BTECH-IT) program since year 2000. Similar to the
development of most information systems, students are required to learn the
methodology, the techniques and the tools. Students in the course learn how an
enterprise data warehouse is developed through a methodological approach,
followed by how specialized types of data warehouse such as data marts and OLAP
systems are built from the enterprise model. In addition, they will learn how
DM concepts can be applied to mine the warehouse data. During each development
phase according to the methodology, students learn different techniques and use
different tools to perform the techniques. In section two, I explore the
relationship between tools and learning activities using activity theory. In
section three, I describe the DW and DM infrastructure, and explain why tools
selection is a major concern. In section four, I introduce two approaches of
tools selection. The first approach uses an integrated tool by a single
software vendor, and the second uses different tools by various vendors. In
section five, I evaluate both approaches based on students’ performance using
observation and interviews.
2.
Tools and Learning
In this section, I explore the relationship between tools
and learning activities and justify why tools selection is important in high
quality learning. In fact, a DW and DM course can be viewed as an activity
system. Activity systems, which are developed based on activity theory [2], are
the basic units of analysis to alleviate problems that afflict the approaches
to learning. In figure 1, the activity system has six components. Essential to
the system are the relations between subjects (instructor and students), the community
(DW and DM developers) of which they are members, and the object of their
activities (DW and DM development). Such relations are mediated by a number of
factors, including the tools (books, software tools) used by the subjects, the
rules (course objectives, norms in techniques) that link subjects to their
communities, and the roles (instructor educates students) adopted by the
community.
Some instructors may allow students to select any tools to
perform the techniques in DW and DM development. In fact, the activity system
suggests that any direct relationships between tools and other components
should be taken into consideration in tools selection. Thus the concerns of the students, the
instructor, and the community are equally important. If one of these components
is ignored, tensions may appear in the system. In real life, developers in the
community usually prefer an integrated tool by a single vendor. However, a
college graduate who possesses knowledge of using only one tool is definitely not
a competitive applicant for jobs in the community. Tensions in an activity have
been explained in the activity theory by contradictions. These contradictions
are the major factors of poor quality learning. Thus, it is justifiable to
educate students with different tools by various vendors in order to minimize
the number of tensions in learning. Besides, an integrated tool always comes
with a dialog driven wizard to help developers to step through the whole
development process. However, such a computer-driven and activity-oriented
approach should be minimized when learning the techniques. In accordance with
the constructivist learning environments design, instructors should educate
students to perform tasks instead of activities [3]. With such a task-oriented
approach, students are better able to deal with DW and DM problems, and better
apply their knowledge to a novel situation. In designing the DW and DM course
at KUC, I have identified eleven techniques in accordance with a generic
methodology, and have designed a separate task for each technique. Since year
2001, I have been selecting a number of tools. Some tools may semi- or fully
automate the tasks, while some are just programming tools that require students
to write computer programs to complete the tasks.
3. Tools for Teaching Data Warehousing and Data
Mining
The DW and DM infrastructure, which is depicted in Fig. 2,
exhibits various layers of data. The operational data stored in the enterprise
database systems formed the bottom layer. Data in this layer are usually
heterogeneous and stored in different database schema. The middle layer
contains reconciled data, which are the result of the integration of various
operational data and database schema [4]. Such a layer represents the data
warehouse for the enterprise. Data transformation tools, data capturing tools,
data scrubbing tools are used during the integration process. Loading tools and
indexing tools provided by most database management systems (DBMSs) are used to populate the warehouse with data. According
to Inmon [1], a data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s decision-making
process. The next layer, which contains highly aggregated data derived from the
reconciled data, supports user activities such as query processing and trend
analysis. Dart marts and OLAP systems are built on the top of the aggregated
layers [5]. Data marts, which are subsets of the data warehouse, are created
with querying tools provided by most DBMSs. OLAP
systems provide users with an analytical engine to view data in a
multi-dimensional ways, and are add-on analysis tools to some commercial DBMSs [6,7]. DM used to be a subset of the activities
associated with the data warehouse but has gained independence today. DM
supports inductive information discovery, which is the process of searching the
data using various investigatory models, methods, and techniques, and then
forming its own hypothesis during a decision-making process [8,9]. Tools selection
for teaching DW and DM concepts is difficult because the concepts are
confluences of multiple disciplines including database technology, statistics,
visualization, information science, machine learning and more.
Tools selection becomes more difficult because there is not
a standard set of DW and DM tools in the industry. In mid 1990s, many companies
used separate tools to create their data warehouse because there were not
integrated data warehouse development tools offered by vendors such as IBM, Oracle
and Microsoft. Today, each of these vendors claims that she can provide
developers an integrated tool. A developer in real-life definitely prefers an
integrated tool by a single vendor in order to avoid the high cost involved in
maintaining multiple repositories created during the data transformation,
capture, scrubbing, and loading processes. However, many issues are raised if
an integrated tool is adopted for teaching DW and DM. Since there is not a
standard set of tools in the field, integrated software tools by different
vendors are having very different user interfaces and using very different
terminology. As a result, students who have already adapted to one particular
integrated tool tend to refuse to use another.
4. Methodologies, techniques, and tools
The development of a data warehouse is similar to that of
any information system [10]. First, a methodology is chosen. Second, techniques
are suggested in each phase by the methodology. Third, tools are selected to
perform the techniques. There is quite a lot of discussion about the
methodology for the design of a data warehouse [11,12]. Each approach has its
pros and cons. Table 1 depicts a generic top-down methodology, the techniques
involved, and tools adopted by KUC’s DW and DM
course. Students in the course are required to complete twelve tasks. Each task
requires students to perform a technique.
Task
|
Methodology
– techniques
|
An
integrated tool was used in year 2000
|
Different
tools were used in year 2001
|
Different
tools were used in year 2002
|
1
|
Requirement
Analysis
|
Oracle
Designer’s (CASE tool) Process Modeler, Dataflow Diagrammer,
ER
Diagrammer
|
ER/Win
(CASE tool)
|
Visio
(Drawing tool),
ER/Win
(CASE tool)
|
2
|
Data
Transformation
|
Oracle
Server’s Import Utility
|
Microsoft
Excel’s Import Utility
|
Microsoft
Excel’s Import Utility
|
3
|
Data
Capture
|
No
tools are used
|
No
tools are used
|
Custom
Programs in VB/Java/C++
|
4
|
Data
Scrubbing
|
No
tools are used
|
No
tools are used
|
Custom
Programs in VB/Java/C++
|
5
|
Logical
Design
|
Oracle
Designer’s Design Transformer
|
ER/Win’s
Database Schema Generation
|
ER/Win’s
Database Schema Generation
|
6
|
Physical
Design
|
Oracle
Designer’s Design Transformer
|
ER/Win’s
Database Schema Generation
|
ER/Win’s
Database Schema Generation
|
7
|
Implementation
of data warehouse
|
Oracle
Server and Client
|
Oracle
Server and Client
|
Oracle
Server and Client
|
8
|
Load
data warehouse with data
|
Oracle’s
SQL Loader
|
Oracle’s
SQL Loader
|
Oracle’s
SQL Loader
|
9
|
Implementation
of data marts
|
Oracle
Server and its SQL engine
|
Oracle
Server and its SQL engine
|
Oracle
Server and its SQL engine
|
10
|
Implementation
of OLAP systems
|
Oracle’s
Express Server
|
Microsoft
Excel’s pivot tables
|
Microsoft
SQL Server’s Analysis Services – OLAP engine
|
11
|
Implementation
of DM solutions
|
No
tools are used
|
No
tools are used
|
Microsoft
SQL Server’s Analysis Services – DM Component
|
12
|
Maintenance
and Model Evolution
|
Oracle
Designer’s Capture Design
|
ER/Win’s
Reverse Engineering
|
ER/Win’s
Reverse Engineering
|
In year 2000, I selected an integrated tool by Oracle [13,14] to support my teaching. Students were required to
perform most of the tasks using Oracle’s Computer Aided Software Engineering
(CASE) tool. Since year 2001, I have been using another approach of selecting
different tools by various vendors. With the new approach, students use a CASE
tool only during requirement analysis, logical design, physical design, and
maintenance. They use other independent tools or even write computer programs
to perform data transformation, capture, and scrubbing. They implement the data
warehouse in Oracle Server, but use Microsoft SQL Server’s Analysis Services
[15] to perform both OLAP and DM.
5. Evaluation
Evaluation on tools selection was carried out in the past
three years with an attempt to collect, analyze and interpret information
through observations and interviews in order to make informed decisions that
enhance quality of learning. Two approaches of tools selection are evaluated.
The first uses an integrated tool by a single vendor, and the second uses
different tools by various vendors. Both observation and interview were done on
ten students selected randomly from each class. The results are summarized in
table 2 and table 3.
Observations
|
An
integrated tool was used in teaching during year 2000
|
Different
tools were used in teaching during year 2001/02
|
Students
were asked to develop a data warehouse data model. I wanted to find out how
they approached the problem with the tool.
|
Students
were provided with only the Oracle tool. Most students spent lots of time to
figure out how the tool was used instead of how the model was created.
|
Students
were given a choice of using either a drawing tool like Visio, or a CASE
tool. Most students started drawing the models on paper manually before
deciding which tool to be used.
|
Students
were asked to develop a data warehouse data model. I wanted to find out how
they approached the problem without the tool.
|
Most
students tried to recall all steps as suggested by the tool. They drew the
model in a way similar to the one generated by the tool.
|
Most
students tried to recall all steps as suggested by the methodology. They drew
the model with additional comments such as assumption, and database
constraints.
|
Students
were asked to draw a data warehouse data model by hand. I wanted to find out
what kinds of notation they would use.
|
Most
students tended to use the exact notation adopted by the Oracle’s tool. They
thought that the notation was the only notation for data warehouse modeling.
|
Most
students used IE or IDEFIX notation that they learned before in their
database management classes. They understood that any data modeling notation
could be used to document the data warehouse model.
|
Students
were asked to perform logical and physical data warehouse designs. I wanted
to find out how they approached both tasks with the assigned tools.
|
All
students used Oracle Designer’s Design Transformer to perform both
tasks. They simply accepted the
default settings suggested by the tool. Thus, they completed both tasks in
minutes.
|
All
students used ER/Win to perform both tasks. However, ER/Win isn’t a product
of Oracle and it actually supports many different target database platforms.
Thus, students needed to specify the target database platform and the detail
of the database schema. Students were spending hours or more in both tasks.
|
Students
were asked to perform data analysis using an OLAP system. I wanted to find
out how they approached the task with the assigned tools.
|
All
students used Oracle Express Server as the OLAP system to perform data
analysis. Since the tool came with a dialog driven wizard, it drove students
through the whole process. Students completed the task in minutes.
|
All
students used Microsoft Excel’s pivot table or SQL Server’s Analysis Services
as OLAP systems to perform the task. Since data was stored in Oracle’s DBMS,
students needed toexport the data into a format
that could be accepted by the OLAP systems. Later, students found a smarter
approach to import data into the OLAP system directly from Oracle’s DBMS
using a middleware such as the ODBC APIs. Students were spending days in the
task.
|
Interviews
|
An
integrated tool was used in teaching during year 2000
|
Different
tools were used in teaching during year 2001/02
|
Students
were asked if they have the confidence to apply the DW and DM concepts in
their work in the future using the same tools as given by the college.
|
All
students had the confidence.
|
All
students had the confidence.
|
Students
were asked if they have the confidence to apply the DW and DM concepts in
their work in the future using different tools other than ones given by the
college.
|
Very
few students had the confidence. Most students claimed that they were not
exposed to other commercial tools and thus were lack of the confidence.
|
Most
students had the confidence.
|
Students
were asked how they would select a tool in their future. If the tool could not handle one or more
tasks, how would they handle the situation?
|
All
students said they preferred the same tool that they learned in the course.
They would probably get help from the vendor if the tool could not handle
some tasks. They believed the vendor would have some suggestions.
|
Some
students preferred an integrated tool, but some had no preference. If the
tool was not able to handle a task, all students believed that they could
come up some solutions. The solutions could be customized computer programs
written in Visual Basic, Java or C++ language.
|
The integrated tool approach was used in year 2000. From
the results, it is found that students became less proactive and had not much
confidence in problem solving, and their activities were mainly driven by the
integrated tool. Being educated with an integrated tool approach, students were
concerned if they would be using the same tool in real-life. Obviously,
students are lack of confidence to apply the learned techniques if the same
tool is not given. Such a result has violated the course objectives [16,17].
Since year 2001, I have been using another approach with an attempt to improve
the quality of learning. The approach of using different tools by various
vendors has brought students both challenges and incentives in learning.
Students find challenging to complete all tasks with different tools, and
prepare themselves as highly competitive applicants for jobs in the industry.
Most of the challenges come from moving data from one environment to another.
Students are required to put in extra effort to identify some appropriate
middleware for data migration. As a result, it is the students, not the tools,
who drive all tasks. Being educated with different tools by various vendors,
students are able to perform required techniques regardless what the tool is.
Some students could even write programs in Visual Basic, Java or C++ to
performed tasks, including data capture and scrubbing, because these tasks may
not be automated by the selected tools. From the results, I find that students
are more confidence in solving DW and DM problems with the approach of using
different tools by various vendors. Such a learning outcome is what I, as an
instructor, would like to see.
6. Conclusion
Due to the strong demand of DW and DM developers in the IT
industry, KUC has been offering in its BTECH-IT program a DW and DM course
since year 2000. I have described in this paper the methodology, techniques and
tools adopted by the course. Since DW and DM are complex topics that involve
many different technologies, I have presented two approaches of tools selection
for teaching and developing the course. The first uses an integrated tool by a
single vendor, and the second uses different tools by various vendors. I have
evaluated both approaches through observation and interviews. I have found that
the second approach outperforms the first one in such a way that students
become more proactive and confidence when applying DW and DM concepts and
techniques in problem solving.
References
- W.H. Inmon. “Building the data warehouse”
John Wiley & Sons, 2002.
- J. Lave. “The Practice of Learning in Understanding
Practice: Perspectives on Activity and Context” Cambridge University Press, 1993.
- D.H. Jonassen. “Designing Constructivist Learning
Environments” Ch. 10 in Instructional-Design Theories
and Models: A New Paradigm of Instructional Theory, vol. II. Lawrence Erlbaum Associates, 1999.
- Y. Arens, C. Y. Chee, C. Hsu, C. A. Knoblock. “Retrieving
and integrating data from multiple information sources” Journal of
Intelligent and Cooperative Information Systems, 1993.
- S. Chaudhuri, U. Dayal. “An overview of data warehousing
and OLAP technology” SIGMOD, 1997.
- R. Agrawal, A. Gupta, S. Sarawagi. “Modeling
multidimensional databases” IBM Almaden Research Center, 1995.
- L. Cabibbo, R. Torlone. “A logical approach to
multidimensional databases” In proceedings of the 6th International
Conference on Extending Database Technology, Springer-Verlag, 1997.
- G. Linoff, M. B Michael, and J. A. Berry. “Data Mining
Techniques” John Wiley & Sons, 1997.
- Olivia Parr Rud. “Data Mining Cookbook” John Wiley &
Sons, 2001.
- C. Adamson, M. Venerable. “Data warehouse design
solutions” John Wiley & Sons, 1998.
- R. Kimball. “The data warehouse toolkit” John Wiley &
Sons, 1996.
- R. Kimball, L. Reeves, M. Ross, W. Thornthwaite. “The Data
warehouse lifecycle toolkit” John Wiley & Sons, 1998.
- Oracle Corporation. “Oracle server utilities user guide”
http://www.oracle.com/ , 2000.
- Oracle Corporation. “Oracle Express Server user guide”
http://www.oracle.com/ , 2000.
- Microsoft Corporation. “Microsoft SQL Server user’s guide”
http://www.microsoft.com/ , 2000.
- Kwantlen University College. “INFO4330 course outline: Data
Warehousing and Data Mining”, 1999.
- Kwantlen University College. “INFO4330 course outline (revised):
Data Warehousing and Data Mining”, 2002.
Andy C. H. Law
Department of Computing Sciences and Information Systems
Kwantlen University College
12666-72nd Ave., Surrey, B.C., Canada V3W 2M8
Andy.Law@kwantlen.ca