Sign up & Download
Sign in

An Empirical Comparison of Probability Estimation Techniques for Probabilistic Rules

by Jan-Nikolas Sulzmann, Johannes Fürnkranz
Discovery Science (2009)

Cite this document (BETA)

Available from www.springerlink.com
Page 1
hidden

An Empirical Comparison of Probability Estimation Techniques for Probabilistic Rules

An Empirical Comparison of Probability
Estimation Techniques for Probabilistic Rules
Jan-Nikolas Sulzmann and Johannes Fu¨rnkranz
Department of Computer Science, TU Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
{sulzmann,juffi}@ke.informatik.tu-darmstadt.de
Abstract. Rule learning is known for its descriptive and therefore com-
prehensible classification models which also yield good class predictions.
However, in some application areas, we also need good class probability
estimates. For different classification models, such as decision trees, a
variety of techniques for obtaining good probability estimates have been
proposed and evaluated. However, so far, there has been no systematic
empirical study of how these techniques can be adapted to probabilistic
rules and how these methods affect the probability-based rankings. In
this paper we apply several basic methods for the estimation of class
membership probabilities to classification rules. We also study the effect
of a shrinkage technique for merging the probability estimates of rules
with those of their generalizations.
1 Introduction
The main focus of symbolic learning algorithms such as decision tree and rule
learners is to produce a comprehensible explanation for a class variable. Thus,
they learn concepts in the form of crisp IF-THEN rules. On the other hand,
many practical applications require a finer distinction between examples than is
provided by their predicted class labels. For example, one may want to be able
to provide a confidence score that estimates the certainty of a prediction, to rank
the predictions according to their probability of belonging to a given class, to
make a cost-sensitive prediction, or to combine multiple predictions.
All these problems can be solved straight-forwardly if we can predict a prob-
ability distribution over all classes instead of a single class value. A straight-
forward approach to estimate probability distributions for classification rules is
to compute the fractions of the covered examples for each class. However, this
na¨ıve approach has obvious disadvantages, such as that rules that cover only a
few examples may lead to extreme probability estimates. Thus, the probability
estimates need to be smoothed.
There has been quite some previous work on probability estimation from
decision trees (so-called probability-estimation trees (PETS)). A very simple,
but quite powerful technique for improving class probability estimates is the
use of m-estimates, or their special case, the Laplace-estimates (Cestnik, 1990).
Provost and Domingos (2003) showed that unpruned decision trees with
J. Gama et al. (Eds.): DS 2009, LNAI 5808, pp. 317–331, 2009.
c
© Springer-Verlag Berlin Heidelberg 2009
Page 2
hidden
318 J.-N. Sulzmann and J. Fu¨rnkranz
Laplace-corrected probability estimates at the leaves produce quite reliable de-
cision tree estimates. Ferri et al. (2003) proposed a recursive computation of
the m-estimate, which uses the probability disctribution at level l as the prior
probabilities for level l + 1. Wang and Zhang (2006) used a general shrinkage
approach, which interpolates the estimated class distribution at the leaf nodes
with the estimates in interior nodes on the path from the root to the leaf.
An interesting observation is that, contrary to classification, class probabil-
ity estimation for decision trees typically works better on unpruned trees than
on pruned trees. The explanation for this is simply that, as all examples in a
leaf receive the same probability estimate, pruned trees provide a much coarser
ranking than unpruned trees. Hu¨llermeier and Vanderlooy (2009) have provided
a simple but elegant analysis of this phenomenon, which shows that replacing a
leaf with a subtree can only lead to an increase in the area under the ROC curve
(AUC), a commonly used measure for the ranking capabilities of an algorithm.
Of course, this only holds for the AUC estimate on the training data, but it still
may provide a strong indication why unpruned PETs typically also outperform
pruned PETs on the test set.
Despite the amount of work on probability estimation for decision trees, there
has been hardly any systematic work on probability estimation for rule learning.
Despite their obvious similarility, we nevertheless argue that a separate study of
probability estimates for rule learning is necessary.
A key difference is that in the case of decision tree learning, probability es-
timates will not change the prediction for an example, because the predicted
class only depends on the probabilities of a single leaf of the tree, and such local
probability estimates are typically monotone in the sense that they all maintain
the majority class as the class with the maximum probability. In the case of rule
learning, on the other hand, each example may be classified by multiple rules,
which may possibly predict different classes. As many tie breaking strategies de-
pend on the class probabilities, a local change in the class probability of a single
rule may change the global prediction of the rule-based classifier.
Because of these non-local effects, it is not evident that the same methods that
work well for decision tree learning will also work well for rule learning. Indeed,
as we will see in this paper, our conclusions differ from those that have been
drawn from similar experiments in decision tree learning. For example, the above-
mentioned argument that unpruned trees will lead to a better (training-set)
AUC than pruned trees, does not straight-forwardly carry over to rule learning,
because the replacement of a leaf with a subtree is a local operation that only
affects the examples that are covered by this leaf. In rule learning, on the other
hand, each example may be covered by multiple rules, so that the effect of
replacing one rule with multiple, more specific rules is less predictable. Moreover,
each example will be covered by some leaf in a decision tree, whereas each rule
learner needs to induce a separate default rule that covers examples that are
covered by no other rule.
The rest of the paper is organized as follows: In section 2 we briefly describe the
basics of probabilistic rule learning and recapitulate the estimation techniques

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
33% Student (Master)
 
33% Ph.D. Student
 
33% Associate Professor
by Country
 
33% Germany
 
33% Portugal