A Visual Language for Internet-based Data Mining and Data Visualisation
Available from
Jameel Syed's profile on Mendeley.
Page 1
A Visual Language for Internet-based Data Mining and Data Visualisation
Abstract
This paper describes a novel application of enhanced
visual programming and visualisation techniques to
support data mining processes on the Internet. While the
idea of using visual languages to support data mining has
been proven to be useful, the usability of existing
implementations has been limited. Here, we consider the
issue of usability of data mining via the Internet. We also
present “interactive visual programming”, a method
which automates the construction of a visual program
through a direct manipulation interface and visualisation.
We also illustrate new techniques for data and model
visualisation that can aid the understanding of data and
models.
1. Introduction
1.1
What is Data Mining?
Data Mining is the search for valuable information in
large volumes of scientific or business data. It combines
the fields of Databases and Data Warehousing with
machine learning algorithms and statistical methods to gain
insight into hidden structures of the data. The challenge of
extracting actionable knowledge from available data
sources is achieved by addressing the following issues:
• The physical size of the distributed data sources
• The computational requirements of analytical
algorithms when executed over very large data sets.
• The usability of the data mining system.
• The ability to support the data mining process
The characteristics of the problem imply that data
mining within a client-based environment is not practical.
Data mining needs to utilise high-performance
architectures for the large-scale computational tasks
involved. The web is therefore an ideal and crucial tool for
co-ordinating data mining tasks and distributing
workloads. At present, there is a growing interest in
providing web-based support for such data-intensive
activities.
1.2
Why a visual language for data mining?
Data mining is not a one-step task. It has been defined
as an iterative process [2], where each sub-step can be
repeated. Visual languages are therefore ideal for
providing a friendly interface to a non-expert user. Similar
systems such as Clementine (ISL)[4] and Enterprise Miner
(SAS)[5] are the most promising attempts at integrating
visual languages into the user interface for data mining.
However, these client-based systems lack the capability of
supporting web-based or enterprise-based distributed
execution. Moreover, the size of data they can handle is
limited by the physical configuration of the client machine.
Although these tools allow the user to interact with a
visual abstraction of the data mining process, the user has
little access to the underlying data they are manipulating.
For example, in Clementine, the user is not able to view
the effect of an individual data manipulation function until
the whole procedure has been defined and executed; the
user has to add a table-viewer node at the end of the
procedure and execute it to view the resulting data. This
tedious task of adding a table-viewer node has to be
repeated each time that the user wants to see the result of a
manipulation function. We argue that interaction through
direct manipulation [11] - performing operations and
immediately seeing the results - of both data and models
can enhance the effectiveness of visual programming in the
context of data mining.
In this paper, we introduce enhancements to visual
programming in the context of data mining. The value-
added features and capabilities also allow for integration
with the Internet platform and thereby increase the
usability of the system.
1.3
Kensington: An enterprise data mining system
A Visual Language for Internet-based Data Mining and Data Visualisation
Jaturon Chattratichat, Yike Guo, Jameel Syed
Imperial College, University of London
180 Queen’s Gate, London SW7 2BZ, United Kingdom
{jc8,yg,jas5}@doc.ic.ac.uk
This paper describes a novel application of enhanced
visual programming and visualisation techniques to
support data mining processes on the Internet. While the
idea of using visual languages to support data mining has
been proven to be useful, the usability of existing
implementations has been limited. Here, we consider the
issue of usability of data mining via the Internet. We also
present “interactive visual programming”, a method
which automates the construction of a visual program
through a direct manipulation interface and visualisation.
We also illustrate new techniques for data and model
visualisation that can aid the understanding of data and
models.
1. Introduction
1.1
What is Data Mining?
Data Mining is the search for valuable information in
large volumes of scientific or business data. It combines
the fields of Databases and Data Warehousing with
machine learning algorithms and statistical methods to gain
insight into hidden structures of the data. The challenge of
extracting actionable knowledge from available data
sources is achieved by addressing the following issues:
• The physical size of the distributed data sources
• The computational requirements of analytical
algorithms when executed over very large data sets.
• The usability of the data mining system.
• The ability to support the data mining process
The characteristics of the problem imply that data
mining within a client-based environment is not practical.
Data mining needs to utilise high-performance
architectures for the large-scale computational tasks
involved. The web is therefore an ideal and crucial tool for
co-ordinating data mining tasks and distributing
workloads. At present, there is a growing interest in
providing web-based support for such data-intensive
activities.
1.2
Why a visual language for data mining?
Data mining is not a one-step task. It has been defined
as an iterative process [2], where each sub-step can be
repeated. Visual languages are therefore ideal for
providing a friendly interface to a non-expert user. Similar
systems such as Clementine (ISL)[4] and Enterprise Miner
(SAS)[5] are the most promising attempts at integrating
visual languages into the user interface for data mining.
However, these client-based systems lack the capability of
supporting web-based or enterprise-based distributed
execution. Moreover, the size of data they can handle is
limited by the physical configuration of the client machine.
Although these tools allow the user to interact with a
visual abstraction of the data mining process, the user has
little access to the underlying data they are manipulating.
For example, in Clementine, the user is not able to view
the effect of an individual data manipulation function until
the whole procedure has been defined and executed; the
user has to add a table-viewer node at the end of the
procedure and execute it to view the resulting data. This
tedious task of adding a table-viewer node has to be
repeated each time that the user wants to see the result of a
manipulation function. We argue that interaction through
direct manipulation [11] - performing operations and
immediately seeing the results - of both data and models
can enhance the effectiveness of visual programming in the
context of data mining.
In this paper, we introduce enhancements to visual
programming in the context of data mining. The value-
added features and capabilities also allow for integration
with the Internet platform and thereby increase the
usability of the system.
1.3
Kensington: An enterprise data mining system
A Visual Language for Internet-based Data Mining and Data Visualisation
Jaturon Chattratichat, Yike Guo, Jameel Syed
Imperial College, University of London
180 Queen’s Gate, London SW7 2BZ, United Kingdom
{jc8,yg,jas5}@doc.ic.ac.uk
Page 2
The Kensington Enterprise Data Mining system is a
research product built by the Data Mining Group at
Imperial College [3]. It is a multi-user data mining system
based on a three-tier architecture. The motivation behind
this system is to deliver an easy-to-use large-scale data
mining system through the use of the Internet. Based on
these motivations, Kensington employs the visual
programming and visualisation techniques for the client
front end. The visual interface is supported by one or more
application middleware (EJB) servers and mining servers.
The components in each tier could reside in different
locations and therefore reflect the true physical distribution
of an organisation. The system employs the Internet as its
main communication infrastructure, and thereby enables
data mining to be performed anywhere.
Kensington’s middleware provides support for database
access, file access, storage and mining task execution
management. The third tier mining servers provide support
for the bulk of the data crunching and numerical
operations.
The main focus of this paper is in the visual language
aspect of the client application. The goal is to build a
simple yet powerful interface on a very “thin” client that
requires minimal configuration. The challenge is to take
advantage of web infrastructure transparently.
The main contributions of the system outlined in this
paper are as follows:
• Allowing data mining to be performed across the
Internet through the use of visual language.
• Enhancing reusability and enable rapid
redeployment of visual programs through object
serialisation and XML file output
• Using a novel method for visually interacting with
models.
• Introducing interactive visual programming
• Building interactive Decision Trees through
visualisation.
In the next section, we will introduce the visual
programming features of Kensington. We have enhanced
existing features that contribute to the effective use of
visual programming as an interface for data mining. We
also discuss how this visual interface deals with remote
data on the Internet. In section 3, we introduce novel
techniques for integrating interactive data and model
manipulation, through appropriate visualisations. The last
section outlines the main contributions of this paper.
2. Visual programming for data mining
In this section, we outline the design of an environment
that supports the visual construction of a data mining
procedure.
2.1 The process of data mining
The data mining process commonly consists of four
separate main stages: data retrieval, data preparation, data
mining and model analysis [1][2]. The data retrieval stage
is an initial step of loading data sources. The data
preparation stage transforms and prepares the data for the
mining stage. The mining stage involves the application of
machine learning or statistical algorithms to produce a
model. This model could be further analysed for accuracy
or utilised for decision support.
In designing a visual programming language for data
mining, we take into consideration the nature of each step
mentioned earlier. Each step is supported by a set of tools
or components, which perform a task-specific function.
The four categories are described in Figure 1.
The left part of the figure describes the logical flow of
information in a data mining task. The caption between
each box describes the output from the previous step. The
retrieval step initially loads data, which is fed into the
manipulation step. The modified data then becomes the
input into the mining step. After the mining operation, a
model is produced, and finally evaluated. As shown on the
right part of the figure, component nodes are grouped and
categorised into different context-sensitive steps.
2.2 Visual task configuration
The visual interface design of the Kensington system is
based on the data mining process model. Its familiar icon-
Figure 1: Data mining task flow and associated
modules
Data Retrieval
Manipulation
Data Mining
Model Analysis
Split, Delete, Transform,
Filter
Machine learning and
statistical algorithms
Data flow
Report,
Reusable procedure
Database + Table Model
Data flow
Model
Model testing modules and
model visualisation
research product built by the Data Mining Group at
Imperial College [3]. It is a multi-user data mining system
based on a three-tier architecture. The motivation behind
this system is to deliver an easy-to-use large-scale data
mining system through the use of the Internet. Based on
these motivations, Kensington employs the visual
programming and visualisation techniques for the client
front end. The visual interface is supported by one or more
application middleware (EJB) servers and mining servers.
The components in each tier could reside in different
locations and therefore reflect the true physical distribution
of an organisation. The system employs the Internet as its
main communication infrastructure, and thereby enables
data mining to be performed anywhere.
Kensington’s middleware provides support for database
access, file access, storage and mining task execution
management. The third tier mining servers provide support
for the bulk of the data crunching and numerical
operations.
The main focus of this paper is in the visual language
aspect of the client application. The goal is to build a
simple yet powerful interface on a very “thin” client that
requires minimal configuration. The challenge is to take
advantage of web infrastructure transparently.
The main contributions of the system outlined in this
paper are as follows:
• Allowing data mining to be performed across the
Internet through the use of visual language.
• Enhancing reusability and enable rapid
redeployment of visual programs through object
serialisation and XML file output
• Using a novel method for visually interacting with
models.
• Introducing interactive visual programming
• Building interactive Decision Trees through
visualisation.
In the next section, we will introduce the visual
programming features of Kensington. We have enhanced
existing features that contribute to the effective use of
visual programming as an interface for data mining. We
also discuss how this visual interface deals with remote
data on the Internet. In section 3, we introduce novel
techniques for integrating interactive data and model
manipulation, through appropriate visualisations. The last
section outlines the main contributions of this paper.
2. Visual programming for data mining
In this section, we outline the design of an environment
that supports the visual construction of a data mining
procedure.
2.1 The process of data mining
The data mining process commonly consists of four
separate main stages: data retrieval, data preparation, data
mining and model analysis [1][2]. The data retrieval stage
is an initial step of loading data sources. The data
preparation stage transforms and prepares the data for the
mining stage. The mining stage involves the application of
machine learning or statistical algorithms to produce a
model. This model could be further analysed for accuracy
or utilised for decision support.
In designing a visual programming language for data
mining, we take into consideration the nature of each step
mentioned earlier. Each step is supported by a set of tools
or components, which perform a task-specific function.
The four categories are described in Figure 1.
The left part of the figure describes the logical flow of
information in a data mining task. The caption between
each box describes the output from the previous step. The
retrieval step initially loads data, which is fed into the
manipulation step. The modified data then becomes the
input into the mining step. After the mining operation, a
model is produced, and finally evaluated. As shown on the
right part of the figure, component nodes are grouped and
categorised into different context-sensitive steps.
2.2 Visual task configuration
The visual interface design of the Kensington system is
based on the data mining process model. Its familiar icon-
Figure 1: Data mining task flow and associated
modules
Data Retrieval
Manipulation
Data Mining
Model Analysis
Split, Delete, Transform,
Filter
Machine learning and
statistical algorithms
Data flow
Report,
Reusable procedure
Database + Table Model
Data flow
Model
Model testing modules and
model visualisation
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
2 Readers on Mendeley
by Discipline
by Academic Status
50% Student (Master)
50% Other Professional
by Country
50% United Kingdom
50% Germany


