Sign up & Download
Sign in

Reclassification of Linearly Classified Data Using Constraint Databases

by Peter Revesz, Thomas Triplet
12th East European Conference on Advances of Databases and Information Systems (2008)
  • ISSN: 03029743

Abstract

In many problems the raw data is already classified according to a variety of features using some linear classification algorithm but needs to be reclassified. We introduce a novel reclassification method that creates new classes by combining in a flexible way the existing classes without requiring access to the raw data. The flexibility is achieved by representing the results of the linear classifications in a linear constraint database and using the full query capabilities of a constraint database system. We implemented this method based on the MLPQ constraint database system. We also tested the method on a data that was already classified using a decision tree algorithm. 2008 Springer-Verlag Berlin Heidelberg.

Cite this document (BETA)

Available from Thomas Triplet's profile on Mendeley.
Page 1
hidden

Reclassification of Linearly Classified Data Using Constraint Databases

Reclassi cation of Linearly Classi ed Data using
Constraint Databases
Peter Revesz and Thomas Triplet
University of Nebraska - Lincoln, Lincoln NE 68588, USA,
revesz@cse.unl.edu, ttriplet@cse.unl.edu
Abstract. In many problems the raw data is already classi ed according
to a variety of features using some linear classi cation algorithm but
needs to be reclassi ed. We introduce a novel reclassi cation method that
creates new classes by combining in a
exible way the existing classes
without requiring access to the raw data. The
exibility is achieved by
representing the results of the linear classi cations in a linear constraint
database and using the full query capabilities of a constraint database
system. We implemented this method based on the MLPQ constraint
database system. We also tested the method on a data that was already
classi ed using a decision tree algorithm.
1 Introduction
Semantics in data and knowledge bases is tied to classi cations of the data.
Classi cations are usually done by classi ers such as decision trees [7], support
vector machines [10], or other machine learning algorithms. After being trained
on some sample data, these classi ers can be used to classify even new data.
The reclassi cation problem is the problem of how to reuse the old classi ers
to derive a new classi er. For example, if one has a classi er for disease A and
another classi er for disease B, then we may need a classi er for patients who
(1) have both diseases, (2) have only disease A, (3) have only disease B, and
(4) have neither disease. In general, when combining n classi ers, there are 2n
combinations to consider. Hence many applications would be simpli ed if we
could use a single combined classi er.
There are several natural questions about the semantics of the resultant
reclassi cation. For example, how many of the combination classes are really
possible? Even if in theory we can have 2n combination classes, in practice the
number may be much smaller. Another question is to estimate the percent of
the patients who fall within each combination class, assuming some statistical
distribution on the measured features of the patients.
For example, Figures 1 and 2 present two di erent ID3 decision tree classi ers
for the country of origin and the miles per gallon fuel eciency of cars. For
simplicity, the country of origin is classi ed as Europe, Japan, or USA, and the
fuel eciency is classi ed as low, medium, and high. Analysis of such decision
trees is dicult. For instance, it is hard to tell by just looking at these decision
trees whether there are any cars from Europe with a low fuel eciency.
Page 2
hidden
II
Fig. 1. Decision tree for the country of origin of cars. The tree is drawn using
Graphviz [2].
Fig. 2. Decision tree for the miles per gallon fuel eciency of cars. The tree is drawn
using Graphviz [2].
Page 3
hidden
III
When a decision tree contains all the attributes mentioned in a query, then
the decision tree can be used to eciently answer the query. Here the problem is
that the attributes mentioned in the query, that is, country/region of origin and
MPG (miles per gallon) fuel eciency, are not both contained in either decision
tree. Reducing this situation to the case of a single decision tree that contains
both attributes would provide a convenient solution to the query.
We propose in this paper a novel approach to reclassi cation that results in
a single classi er and enables to answer several semantic questions. Reusability
and
exibility are achieved by representing the original classi cations in linear
constraint databases [6, 8] and using constraint database queries.
Background and Contribution: Earlier papers by Geist [4] and Johnson
et al. [5] talked about the representation of decision trees in constraint databases.
However, they did not consider support vector machines (SVMs) or the reclassi -
cation problem. The reclassi cation problem is a special problem, and its detailed
consideration and explanation of the semantic issues regarding reclassi cations
are important contributions of this paper. In particular, we point out that the
reclassi cation problem is solved nicely with the use of constraint databases. Fur-
thermore, our experiments compare several di erent possible approaches to the
reclassi cation problem. The experiments support the idea that the constraint
database approach to reclassi cation is an accurate and
exible method.
The rest of the paper is organized as follows. Section 2 presents a review of lin-
ear classi ers. Section 3 presents the reclassi cation problem. Section 4 describes
two novel approaches to the reclassi cation problem. The rst approach is called
Reclassi cation with an oracle, and the second approach is called Reclassi cation
with constraint databases. Section 5 describes some computer experiments and
discusses their signi cance. Finally, Section 6 gives some concluding remarks and
open problems.
2 Review of Linear Classi ers
The problem is the following: we want to classify items, which means we want to
predict a characteristic of an item based on several parameters of the item. Each
parameter is represented by a variable which can take a nite number of values.
The set of those variables is called feature space. The actual characteristic of
the item we want to predict is called the label or class of the item. To make the
predictions, we use a machine learning technique called classi er. A classi er
maps a feature space X to a set of labels Y . A linear classi er maps X to Y by
a linear function.
Example 1. Suppose that a disease is conditioned by two antibodies A and B.
The feature space X is X = fAntibody A;Antibody Bg and the set of labels is
Y = fDisease; no Diseaseg. Then, a linear classi er is:
y = w1:Antibody A+ w2:Antibody B + c
Page 4
hidden
IV
where w1; w2 2 R are constant weights and c 2 R is a threshold constant.
The y 2 Y value can be compared with zero to yield a classi er. That is,
{ If y  0 then the patient has no Disease.
{ If y > 0 then the patient has Disease.
In general, assuming that each parameter can be assigned a numerical value
xi, a linear classi er is a linear combination of the parameters:
y = f(
X
j
wjxj) (1)
where f is a linear function. wi 2 R are the weights of the classi ers and
entirely de ne it. y 2 Y is the predicted label of the instance.
Decision trees: the ID3 algorithm Decision trees, also called active classi er
were particularly used in the nineties by arti cial intelligence experts. The main
reasons are that they can be easily implemented (using ID3 for instance) and
that they give an explanation of the result.
Algorithmically speaking, a decision tree is a tree:
{ An internal node tests an attribute,
{ A branch corresponds to the value of the attribute,
{ A leaf assigns a classi cation.
The output of decision trees is a set of logical rules (disjunction of conjunc-
tions). To train the decision tree, we can use the ID3 algorithm, proposed by
J.R. Quinlan et al. [7] in 1979 in the following three steps:
1. First, the best attribute, A, is chosen for the next node. The best attribute
maximizes the information gain.
2. Then, we create a descendant for each possible value of the attribute A.
3. This procedure is eventually applied to non-perfectly classi ed children.
This best attribute is the one, which maximizes the information gain. The
information gain is de ned as follows:
Gain(S;A) = entropy(S)
X
v2values(A)
jSvj
jSj
:entropy(Sv)
S is a sample of the training examples and A is a partition of the parameters.
Like in thermodynamics, the entropy measures the impurity of S, purer subsets
having a lower entropy:
entropy(S) =
nX
i=0
pi:log2(pi)
Page 5
hidden
VS is a sample of the training examples, pi is the proportion of i-valued ex-
amples in S and n is the number of attributes.
ID3 is a greedy algorithm, without backtracking. This means that this al-
gorithm is sensible to local optima. Furthermore, ID3 is inductively biased: the
algorithm favors short trees and high information gain attributes near the root.
At the end of the procedure, the decision tree perfectly suits the training data
including noisy data. This leads to complex trees, which usually lead prob-
lems to generalize new data (classify unseen data). According to Occams razor,
shortest explanations should be preferred. In order to avoid over- tting, deci-
sion trees are therefore usually pruned after the training stage by minimizing
size(tree) + error rate, with size(tree) the number is leaves in the tree and
errorrate the ratio of the number of misclassi ed instances by the total number
of instances (also equal to 1 accuracy).
Note: The ID3 decision tree and the support vector machine are linear clas-
si ers because their e ects can be represented mathematically in the form of
Equation (1).
3 The Reclassi cation Problem
The need for reclassi cation arises in many situations. Consider the following.
Example 2. One study found a classi er for the origin of cars using
X1 = facceleration; cylinders; displacement; horsepowerg
and
Y1 = fEurope; Japan; USAg
where acceleration from 0 to 60 mph is measured in seconds (between 8 and
24.8 seconds), cylinders the number of cylinders of the engine (between 3 and 8
cylinders), displacement in cubic inches (between 68 and 455 cubic inches) and
horsepower the standard measure of the power of the engine (between 46 and
230 horsepower).
A sample training data is shown in the table below.
Origin
Acceleration Cylinders Displacement Horsepower Country
12 4 304 150 USA
9 3 454 220 Europe
Another study found another classi er for the fuel eciency of cars using
X2 = facceleration; displacement; horsepower; weightg
and
Y2 = flow;medium; highg
where the weight of the car is measured in pounds (between 732 and 5140 lbs.).
A sample training data for the second study is shown in the table below.
Page 6
hidden
VI
Eciency
Acceleration Displacement Horsepower Weight MPG
20 120 87 2634 medium
15 130 97 2234 high
Suppose we need to nd a classi er for
X = X1 [X2 = facceleration; cylinders; displacement; horsepower; weightg
and
Y = Y1  Y2 = fEurope low;Europemedium;Europe high;
Japan low; Japanmedium; Japan high;
USA low; USAmedium;USA highg
Building a new class er for (X;Y ) seems easy, but the problem is that there
is no database for (X;Y ). Finding such a database would require a new study
with more data collection, which would take a considerable time. That motivates
the need for reclassi cation. As Section 4.1 shows, a classi er for (X;Y ) can be
built by an ecient reclassi cation algorithm that uses only the already existing
classi ers for (X1; Y1) and (X2; Y2).
4 Novel Reclassi cation Methods
We introduce now several new reclassi cation methods. Section 4.1 describes
two variants of the Reclassi cation with an oracle method. While oracle-based
methods do not exist in practice, these methods give a limit to the best possible
practical methods. Section 4.2 describes the practical Reclassi cation with con-
straint databases method. A comparison of these two methods is given later in
Section 5.
4.1 Reclassi cation with an Oracle
In theoretical computer science, researchers study the computational complexity
of algorithms in the presence of an oracle that tells some extra information that
can be used by the algorithm. The computational complexity results derived
this way can be useful in establishing theoretical limits to the computational
complexity of the studied algorithms.
Similarly, in this section we study the reclassi cation problem with a special
type of oracle. The oracle we allow can tell the value of a missing attribute of
each record. This allows us to derive essentially a theoretical upper bound on
the best reclassi cation that can be achieved. The reclassi cation with oracle
method extends each of the original relations with the attributes that occur
only in the other relation. Then one can take a union of the extended relations
and apply any of the classi cation algorithms one chooses. We illustrate the idea
behind the Reclassi cation with an oracle using an extension of Example 2 and
ID3.
Page 7
hidden
VII
Example 3. First, we add a weight and an MPG attribute to each record in the
Origin relation using an oracle. Suppose we get the following:
Origin
Acceleration Cylinders Displacement Horsepower Weight Country MPG
12 4 304 150 4354 USA low
9 3 454 220 3086 Japan medium
Second, we add a cylinders and a country attribute to each record in the
Eciency relation using an oracle. Suppose we get the following:
Eciency
Acceleration Cylinders Displacement Horsepower Weight Country MPG
20 6 120 87 2634 USA medium
15 4 130 97 2234 Europe high
After the union of these two relations, we can train an ID3 decision tree to
yield a reclassi cation as required in Example 2.
A slight variation of the Reclassi cation with an oracle method is the Reclas-
si cation with an X-oracle. That means that we only use the oracle to extend
the original relations with the missing X attributes. For example, in the car ex-
ample, we use the oracle to extend the Origin relation by only weight, and the
Eciency relation by only cylinders.
When we do that, then the original classi cation for MPG (derived from the
second study) can be applied to the records in the extended Origin relation.
Note that this avoids using an oracle to ll in the MPG values, which is a Y or
target value. Similarly, the original classi cation for country (derived from the
rst study) can be applied to the records in the extended Eciency relation.
The Reclassi cation with an X-oracle also is not a practical method except
if the two original studies have exactly the same set of X attributes because
oracles do not exist and therefore can not be used in practice.
4.2 Reclassi cation with Constraint Databases
The Reclassi cation with Constraint Databases method has two main steps:
Translation to Constraint Relations: We translate the original linear clas-
si ers to a constraint database representation. Our method does not depend
on any particular linear classi cation method. It can be an ID3 decision
tree method [7] or a support vector machine classi cation [10] or some other
linear classi cation method.
Join: The linear constraint relations are joined together using a constraint join
operator [6, 8].
Page 8
hidden
VIII
Example 4. Figure 1, which we saw earlier, shows an ID3 decision tree for the
country of origin of the cars obtained after training by 50 random samples from
a cars database [1]. A straightforward translation from the original decision tree
to a linear constraint database does not yield a good result for problems where
the attributes can have real number values instead of only discrete values. Real
number values are often used when we measure some attribute like weight in
pounds or volume in centiliters.
Hence we improve the naive translation by introducing comparison con-
straints >;<;; to allow continuous values for some attributes.
That is, we translate each node of the decision tree by analyzing all of its
children. First, the children of each node are sorted based on the possible values
of the attribute. Then, we de ne an interval around each discrete value based
on the values of the previous and the following children. The lower bound of
the interval is de ned as the median value between the value of the current
child and the value of the previous child. Similarly, the upper bound of the
interval is de ned as the median value of the current and the following children.
For instance, assume we have the values f10; 14; 20g for an attribute for the
children. This will lead to the intervals f(1; 12]; (12; 17]; (17;+1)g.
In the following, let a be the acceleration, c the number of cylinders, d the
displacement of the engine, h the horsepower, w the weight of the car, country
the origin of the car, and mpg the miles per gallon of the car. We use the depth-
rst algorithm with the above heuristic on the cars data from [1] to generate the
following MLPQ [9] constraint database:
Origin(a,c,d,h,country) :- c = 3, country ='JAPAN'.
Origin(a,c,d,h,country) :- c = 4, d > 111, country ='USA'.
Origin(a,c,d,h,country) :- c = 4, d > 96, d <= 111, country ='EUROPE'.
Origin(a,c,d,h,country) :- c = 4, d > 87, d <= 96, h > 67, country ='USA'.
...
Similarly, we used another decision tree to classify the eciency of the cars.
Translating the second decision tree yielded the following constraint relation:
Efficiency(a,d,h,w,mpg) :- d <= 103, mpg = 'low'.
Efficiency(a,d,h,w,mpg) :- d > 103 , d <= 112 , h < 68, mpg = 'high'.
Efficiency(a,d,h,w,mpg) :- d > 420, mpg = 'low'.
...
Now the reclassi cation problem can be solved by a constraint database join
of the Origin and Eciency relations. The join is expressed by the following
Datalog query:
Page 9
hidden
IX
Car(a,c,d,h,w,country,mpg) :- Origin(a,c,d,h,country),
Efficiency(a,d,h,w,mpg).
The reclassi cation can be used to predict for any particular car its country
of origin and fuel eciency. For example, if we have a car with a = 19:5; c =
4; d = 120; h = 87, and w = 2979, then we can use the following Datalog query:
Predict(country,mpg) :- Car(a,c,d,h,w,country,mpg),
a = 19.5, c = 4, d = 120, h = 87, w = 2979.
The prediction for this car is that it is from Europe and has a low fuel
eciency. Note that instead of Datalog queries, in the MLPQ constraint database
system one also can use logically equivalent SQL queries to express the above
problems.
Semantic Analysis of the Reclassi cation: Returning to the semantics
questions raised in the introduction, one can test each constraint row of the
Cars relation whether it is satis able or not. That allows the testing of which
combination classes (out of the nine target labels) is possible. Moreover, the size
of each region can be calculated. Assuming some simple distributions of the cars
within the feature space, one can estimate the number of cars in each region.
Hence one also can estimate the percent of cars that belong to each combination
class.
5 Experiments and Discussion
The goal of our experiments is to compare the Reclassi cation with constraint
databases and the Reclassi cation with an oracle methods. It is important to
make the experiments such that those abstract away from the side issue of which
exact classi cation algorithm (ID3, SVM etc.) is used for the original classi ca-
tion.
In our experiments we use the ID3 method as described in Example 4. There-
fore, we compared the Reclassi cation with constraint databases assuming that
ID3 was the original classi cation method with the Reclassi cation with an or-
acle assuming that the same ID3 method was used within it. We also compared
Reclassi cation with constraint databases with the original linear classi cation
(decision tree) for each class. We chose the ID3 decision tree method for the
experiments because it had a non-copyrighted software. Using ID3 already gave
interesting results that helped compare the relative accuracy of the methods.
Likely a more complex decision tree method would make the accuracy of all
the practical algorithms, including the reclassi cation methods, proportionally
better without changing their relative order.
5.1 Experiment with the dataset "Primary Biliary Cirrhosis"
The Primary Biliary Cirrhosis (PBC) data set, collected between 1974 and 1984
by the Mayo Clinic about 314 patients [3], contains the following features:
Page 10
hidden
X1. case number,
2. days between registration and the earliest of death, transplantion, or study
analysis time,
3. age in days,
4. sex (0=male, 1=female),
5. asictes present (0=no or 1=yes),
6. hepatomegaly present (0=no or 1=yes),
7. spiders present (0=no or 1=yes),
8. edema (0 = no edema, 0.5 = edema resolved with/without diuretics, 1 =
edema despite diuretics),
9. serum bilirubin in mg/dl,
10. serum cholesterol in mg/dl,
11. albumin in mg/dl,
12. urine copper in g/day,
13. alkaline phosphatase in Units/liter,
14. SGOT in Units/ml,
15. triglicerides in mg/dl,
16. platelets per cubic ml/1000,
17. prothrombin time in seconds,
18. status (0=alive, 1=transplanted, or 2=dead),
19. drug (1=D-penicillamine or 2=placebo), and
20. histologic stage of disease (1, 2, 3, 4).
We generated the following three subsets from the original data set:
DISEASE with features (3, 4, 5, 7, 8, 9, 10, 13, 14, 16, 17, 20),
DRUG with features (3, 4, 6, 7, 8, 9, 10, 11, 13, 16, 17, 19), and
STATUS with features (3, 4, 5, 6, 7, 8, 9, 10, 11, 16, 17, 18).
In each subset, we used the rst eleven features to predict the the twetfth,
that is, the last feature.
The results of the experiment are shown in Figures 4, 5, 6 and 7.
It can seen from Figures 4, 5 and 6 that the accuracy of the Reclassi cation
with constraint databases has signi cantly improved compared to the original
linear classi cation (ID3) of a single class.
Figures 7 shows that the Reclassi cation with constraint databases and the
Reclassi cation with an oracle perform very similarly. Hence the practical Reclas-
si cation with constraint databases method achieves what can be considered as
the theoretical limit represented by the Reclassi cation with an oracle method.
Note that by theoretical limit we mean only a maximum achievable with the
use of the ID3 linear classi cation algorithm. Presumably, if we use for example
support vector machines, then both methods will improve proportionally.
5.2 Experiment with the dataset "cars"
In this experiment we used the car data set (pieces of which we used in the
examples of this paper) from [1] and the MLPQ constraint database system [9].
The results of the experiment cv are shown in Figures 8, 9 and 10.
Page 11
hidden
XI
Fig. 3. Tree representation of the constraints using for the prediction of the status of a
patient using PBC data. Note that this tree is di erent from the decision tree generated
using the standard ID3 algorithm (here the values of the attributes are de ned using
constraints). The tree is drawn using Graphviz [2].
Fig. 4. Comparison of the Reclassi cation with constraint databases (solid line) and
the original Classi cation with a decision tree (ID3) for the prediction of the class
DISEASE-STAGE of the patients (dashed line) methods using PBC data.
Page 12
hidden
XII
Fig. 5. Comparison of the Reclassi cation with constraint databases (solid line) and the
original Classi cation with a decision tree (ID3) for the prediction of the class STATUS
of the patients (dashed line) methods using PBC data.
Fig. 6. Comparison of the Reclassi cation with constraint databases (solid line) and the
original Classi cation with a decision tree (ID3) for the prediction of the class DRUG
of the cars (dashed line) methods using PBC data.
Page 13
hidden
XIII
Fig. 7. Comparison of the Reclassi cation with constraint databases (solid line) and
the Reclassi cation with an oracle (dashed line) methods using PBC data.
Fig. 8. Comparison of the Reclassi cation with constraint databases (solid line) and the
original Classi cation with a decision tree (ID3) for the prediction of the class ORIGIN
of the cars (dashed line) using cars data.
Page 14
hidden
XIV
Fig. 9. Comparison of the Reclassi cation with constraint databases (solid line) and the
original Classi cation with a decision tree (ID3) for the prediction of the class MPG
eciency of the cars (dashed line) using cars data.
Fig. 10. Comparison of the Reclassi cation with constraint databases (solid line) and
the Reclassi cation with an oracle (dashed line) using cars data.
Page 15
hidden
XV
This second experiment agrees with the rst data set in that constraint
database-based reclassi cation performs better than the original linear decision
tree-based classi cation.
6 Conclusions
The most important conclusion that can be drawn from the study and the exper-
iments is that the Reclassi cation with constraint databases method improves
the accuracy of linear classi er such as decision trees. The proposed method is
also close to the theoretical optimal when joining two classes and is safe to use
in practice.
There are several open problems. We plan to experiment with other data
sets and use the linear Support Vector Machine algorithm in addition to the ID3
algorithm in the future. Also, when an appropriate data can be found, we also
would like to test the Reclassi cation method with X-oracles.
Acknowledgement: The work of the rst author of this paper was sup-
ported in part by a Fulbright Senior U.S. Scholarship from the Fulbright Foundation-
Greece.
References
1. D. Donoho and E. Ramos. The CRCARS dataset. Exposition of Statistical Graph-
ics Technology, Toronto, 1983.
2. J. Ellson, E. Gansner, E. Koutso os, S. North, and G. Woodhull. Graphviz and
dynagraph { static and dynamic graph drawing tools. In M. Junger and P. Mutzel,
editors, Graph Drawing Software, pages 127{148. Springer-Verlag, 2003.
3. T. R. Fleming and D. P. Harrington. Counting Processes and Survival Analysis.
Wiley, New York, 1991.
4. I. Geist. A framework for data mining and KDD. In Proc. ACM Symposium on
Applied Computing, pages 508{13. ACM Press, 2002.
5. T. Johnson, L. V. Lakshmanan, and R. T. Ng. The 3W model and algebra for uni-
ed data mining. In Proc. IEEE International Conference on Very Large Databases,
pages 21{32, 2000.
6. G. M. Kuper, L. Libkin, and J. Paredaens, editors. Constraint Databases. Springer-
Verlag, 2000.
7. J. Quinlan. Induction of decision trees. Machine Learning, 1(1):81{106, 1986.
8. P. Revesz. Introduction to Constraint Databases. Springer-Verlag, 2002.
9. P. Revesz, R. Chen, P. Kanjamala, Y. Li, Y. Liu, and Y. Wang. The MLPQ/GIS
constraint database system. In Proc. ACM SIGMOD International Conference on
Management of Data, 2000.
10. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
50% Doctoral Student
 
50% Post Doc
by Country
 
50% Canada
 
50% United States

Groups

Mendeley