Knowledge Discovery in Multi-label Phenotype Data
- ISBN: 3540425349
- DOI: 10.1007/3-540-44794-6_4
Abstract
The biological sciences are undergoing an explosion in the amount of available data. New data analysis methods are needed to deal with the data. We present work using KDD to analyse data from mutant phenotype growth experiments with the yeast S. cerevisiae to predict novel gene functions. The analysis of the data presented a number of challenges: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We developed resampling strategies and modified the algorithm C4.5 to deal with these problems. Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 putative genes of currently unknown function at an estimated accuracy of 80%.
Knowledge Discovery in Multi-label Phenotype Data
Phenotype Data
Amanda Clare and Ross D. King
Department of Computer Science,
University of Wales Aberystwyth, SY23 3DB, UK
fajc99,rdkg@aber.ac.uk
Abstract. The biological sciences are undergoing an explosion in the
amount of available data. New data analysis methods are needed to deal
with the data. We present work using KDD to analyse data from mutant
phenotype growth experiments with the yeast S. cerevisiae to predict
novel gene functions. The analysis of the data presented a number of chal-
lenges: multi-class labels, a large number of sparsely populated classes,
the need to learn a set of accurate rules (not a complete classication),
and a very large amount of missing values. We developed resampling
strategies and modied the algorithm C4.5 to deal with these problems.
Rules were learnt which are accurate and biologically meaningful. The
rules predict function of 83 putative genes of currently unknown function
at an estimated accuracy of 80%.
1 Introduction
The biological sciences are undergoing an unprecedented increase in the amount
of available data. In the last few years the complete genomes of ~30 microbes
have been sequenced, as well as that of \the worm" (C. elegans) and \the
y" (D.
melanogaster). The last few months have seen the sequencing of the rst plant
genome Arabidopsis [10], and the greatest prize of all, the human genome [11, 33].
In addition to data from sequencing, new post genomic technologies are enabling
the large-scale and parallel interrogation of cell states under dierent stages of
development and under particular environmental conditions, generating very
large databases. Such analyses may be carried out at the level of mRNA using
micro-arrays (e.g. [6, 9]) (the transcriptome). Similar analyses may be carried out
at the level of the protein to dene the proteome (e.g. [2]), or at the level of small
molecules, the metabolome (e.g. [27]). This data is replete with undiscovered
biological knowledge which holds the promise of revolutionising biotechnology
and medicine. KDD techniques are well suited to extracting this knowledge.
Currently most KDD analysis of bioinformatic data has been based on using
unsupervised methods e.g. [9, 17, 32], but some has been based on supervised
methods [4, 7, 14]. New KDD methods are constantly required to meet the new
challenges presented by new forms of bioinformatic data.
Perhaps the least analysed form of genomics data is that from phenotype
experiments [25, 22, 18]. In these experiments specic genes are removed from the
conditions with the aim of nding growth conditions where the mutant and the
wild type (no mutation) dier (\a phenotype"). This approach is analogous to
removing components from a car and then attempting to drive the car under
dierent conditions to diagnose the role of the missing component.
In this paper we have developed KDD techniques to analyse phenotype ex-
periment data. We wish to learn rules that given a particular set of phenotype
experimental results predict the functional class of the gene mutated. This is an
important biological problem because, even in yeast, one the best characterised
organisms, the function of 30{40% of its genes are still currently unknown.
Phenotype experiment data presents a number of challenges to standard data
analysis methods: the functional classes for genes exist in a hierarchy, a gene may
have more than one functional class, and we wish to learn a set of accurate rules
- not necessarily a complete classication. The recognition of functional class hi-
erarchies has been one of the most important recent advances in bioinformatics
[29, 1, 13]. For example in the Munich Information Center for Protein Sequences
(MIPS) hierarchy (http://mips.gsf.de/proj/yeast/catalogues/funcat/) the top
level of the hierarchy has classes such as: \Metabolism", \Energy", \Transcrip-
tion" and \Protein Synthesis". Each of these classes is then subdivided into more
specic classes, and these are in turn subdivided, and then again subdivided, so
the hierarchy is up to 4 levels deep. An example of a subclass of \Metabolism"
is \amino-acid metabolism", and an example of a subclass of this is \amino-acid
biosynthesis". An example of a gene in this subclass is YPR145w (gene name
ASN1, product \asparagine synthetase"). In neither machine learning or statis-
tics has much work has been done on classication problems where there is a
class hierarchy. However, such problems are relatively common in the real world,
particularly in text classication [16, 24, 21]. We deal with the class hierarchy by
learning separate classiers for each level. This simple approach has the unfor-
tunate side-eect of fragmenting the class structure and producing many classes
with few members - e.g. there are 99 potential classes represented in the data
for level 2 in the hierarchy. We have therefore developed a resampling method
to deal with the problem of learning rules from sparse data and few examples
per class.
Perhaps an even greater diculty with the data is that genes may have more
than one functional class. This is re
ected in the MIPS classication scheme
(where a single gene can belong to up to 10 dierent functional classes). This
means that the classication problem is a multi-label one (as opposed to multi-
class which usually refers to simply having more than two possible disjoint classes
for the classier to learn). There is only a limited literature on such problems, for
example [12, 20, 30]. The UCI repository [3] currently contains just one dataset
(\University") that can be considered a multi-label problem. (This dataset shows
the academic emphasis of individual universities, which can be multi-valued, for
example, business-education, engineering, accounting and ne-arts). The sim-
plest approach to the multi-label problem is to learn separate classiers for each
class (with all genes not belonging to a specic class used as negative examples
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


