Classification, Clustering and Data-Mining of Biological Data
Abstract
The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are currently over 1100 molecular biology databases dispersed throughout the Internet. However, very few of them integrate data from multiple sources. To assist in the functional and evolutionary analysis of the abundant number of novel proteins, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database that integrates data from various biological sources. PROFESS is freely available at http://cse.unl.edu/~profess/. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. Using PROFESS, we were able to quantify homologous protein evolution and determine whether bacterial protein structures are subject to random drift after divergence from a common ancestor. After relevant data have been mined, they may be classified or clustered for further analysis. Data classification is usually achieved using machine-learning techniques. However, in many problems the raw data are already classified according to a set of features but need to be reclassified. Data reclassification is usually achieved using data integration methods that require the raw data, which may not be available or sharable because of privacy and legal concerns. We introduce general classification integration and reclassification methods that create new classes by combining in a flexible way the existing classes without requiring access to the raw data. The flexibility is achieved by representing any linear classification in a constraint database. We also considered temporal data classification where the input is a temporal database that describes measurements over a period of time in history while the predicted class is expected to occur in the future. We experimented the proposed classification methods on five datasets covering the automobile, meteorological and medical areas and showed significant improvements over existing methods.
Classification, Clustering and Data-Mining of Biological Data
and Data-Mining
of Biological Data
by
Thomas Triplet
A Dissertation
Presented to the Faculty of
The Graduate College at the University of Nebraska
In Partial Fulllment of Requirements
For the Degree of Doctor of Philosophy
Major: Computer Science
(Bioinformatics)
Under the Supervision of Professor Peter Revesz
Lincoln, Nebraska
November, 2009
and Data-Mining of Biological Data
Thomas Triplet, Ph.D.
University of Nebraska, 2009
Advisor: Peter Revesz
The proliferation of biological databases and the easy access enabled by the
Internet is having a benecial impact on biological sciences and transforming
the way research is conducted. There are currently over 1100 molecular biology
databases dispersed throughout the Internet. However, very few of them inte-
grate data from multiple sources. To assist in the functional and evolutionary
analysis of the abundant number of novel proteins, we introduce the PROFESS
(PROtein Function, Evolution, Structure and Sequence) database that inte-
grates data from various biological sources. PROFESS is freely available at
http://cse.unl.edu/~profess/. Our database is designed to be versatile and
expandable and will not conne analysis to a pre-existing set of data relation-
ships. Using PROFESS, we were able to quantify homologous protein evolution
and determine whether bacterial protein structures are subject to random drift
after divergence from a common ancestor.
After relevant data have been mined, they may be classied or clustered for
further analysis. Data classication is usually achieved using machine-learning
techniques. However, in many problems the raw data are already classied ac-
cording to a set of features but need to be reclassied. Data reclassication
is usually achieved using data integration methods that require the raw data,
which may not be available or sharable because of privacy and legal concerns.
We introduce general classication integration and reclassication methods that
create new classes by combining in a
exible way the existing classes without
requiring access to the raw data. The
exibility is achieved by representing
any linear classication in a constraint database. We also considered temporal
data classication where the input is a temporal database that describes mea-
surements over a period of time in history while the predicted class is expected
to occur in the future. We experimented the proposed classication methods
on ve datasets covering the automobile, meteorological and medical areas and
showed signicant improvements over existing methods.
I would like to express my sincere gratitude to my thesis advisor Professor Peter
Revesz who rst introduced bioinformatics to me. Without his brilliant ideas
and guidance I would not have been able to complete this dissertation. He has
helped me to expand the breadth and depth of my knowledge and research by
providing many insights into my research problems.
I would like to thank Dr. Jitender Deogun, Dr. Mark Griep and Dr. Robert
Powers for serving on my thesis committee. I sincerely appreciate their ines-
timable feedback on my research and their valuable comments on my disserta-
tion. A special thanks to Dr. Jean-Jack Riethoven for giving me the opportunity
to work with him at the Bioinformatics Core Research Facility. I would also like
to thank Matt Shortridge for answering so many of my questions in biology.
I am grateful to the Milton E. Mohr Fellowship at the University of Nebraska-
Lincoln and the Department of Computer Science & Engineering for their nan-
cial support related to this work. I am also especially grateful to Dr. Gregory
Butler from Concordia University (Montreal, QC) for his understanding and his
support while I was nishing my dissertation, in particular during the ADBIS
2009 conference in Riga.
No special thanks to Dell, whose customer service proved to be remarkably
worthless when my laptop burnt three weeks before submitting this dissertation
(for the second time in one year!). The numerous overnight experiments I ran
on my laptop to complete this work must have been too intensive...
Finally, I would like to thank my parents who have provided great support
and encouragement throughout my education. Last, I would like to thank my
adorable ancee Chloe for her understanding, love, and support throughout all
these years and short nights.
2.2.2 Protein Structure Alignments . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Supervised Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . 32
2.3.1 Linear Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Constraint Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Inverse Distance Weighted Interpolation . . . . . . . . . . . . . . . . . . . . 40
3 The PROFESS Database 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Database Integration Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Overview of PROFESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Database Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Functional Annotation of the Protein Data Bank . . . . . . . . . . . 50
3.4.2 Functional Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Phylogenetic Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.4 Structural Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.5 Sequence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Local-As-View Data Integration and Database Design . . . . . . . . . . . . 58
3.6 Functional-Style Query System . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6.1 The PROFESSor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6.2 Functional-Style Query System . . . . . . . . . . . . . . . . . . . . . 62
3.7 Web User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Structural Comparison of Functional Orthologs 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Functional Annotation of Protein Structures . . . . . . . . . . . . . . . . . . 68
4.3 Pairwise Structure Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Phylogenetic Analysis of Functional Orthologs . . . . . . . . . . . . . . . . . 74
4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Structure Divergence Rates across Phyla . . . . . . . . . . . . . . . . . . . . 77
4.6 Fold dependency on Structure Similarity . . . . . . . . . . . . . . . . . . . . 79
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Experimental Datasets 83
5.1 CRCars Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Google Flu Trends Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Heart Disease Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Primary Biliary Cirrhosis Trial . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Texas Commission on Environmental Quality Dataset . . . . . . . . . . . . 90
6 Representation and Querying of Linear Classiers 93
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Representation and Querying of SVMs . . . . . . . . . . . . . . . . . . . . . 94
6.3 Representation and Querying of ID3 Decision Trees . . . . . . . . . . . . . . 96
6.4 Representation and Querying of ID3-Interval
Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Data and Classier Integration 99
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 The Classication Problem with Multiple Sources . . . . . . . . . . . . . . . 101
7.2.1 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.2 Classier Integration with Constraint Databases . . . . . . . . . . . 102
7.3 Experimental Evaluation of the Classier Integration Method . . . . . . . . 104
7.3.1 Experimental Protocol and Results . . . . . . . . . . . . . . . . . . . 104
7.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8 Data Reclassication 108
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.2 The Reclassication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3 Reclassication with an Oracle . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.4 Reclassication with Constraint Databases . . . . . . . . . . . . . . . . . . . 112
8.5 Comparison of Reclassication with an Oracle and Constraint Databases . . 116
8.5.1 Experimental Results with the CRCARS data set . . . . . . . . . . . 116
8.5.2 Experimental Results with the PBC database and Discussion . . . . 116
9 Temporal Classication 122
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2 Temporal Classications with Historical Data . . . . . . . . . . . . . . . . . 125
9.3 Experimental Evaluation of the Temporal Classication Method . . . . . . 126
9.3.1 Experimental Results with TCEQ Data . . . . . . . . . . . . . . . . 126
9.3.2 Experimental Results with Reduced TCEQ Data . . . . . . . . . . . 130
9.4 Comparison of the Temporal Classication Method and the IDW Interpolation130
9.4.1 Experimental Results with Temporal FLU Data . . . . . . . . . . . 130
9.4.2 Experimental Results with Spatio-Temporal FLU Data . . . . . . . 132
9.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10 Conclusion 135
References 137
2.1 Hydrolyze of carbohydrate sucrose into monosaccharides glucose and fructose 7
2.2 Double complementary strand structure of DNA . . . . . . . . . . . . . . . 8
2.3 Central dogma of biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 The four hierarchical levels of protein structures . . . . . . . . . . . . . . . 14
2.5 General structure of an amino-acid in its zwitterionic form . . . . . . . . . . 15
2.6 Classication of amino-acids properties using a Venn diagram . . . . . . . . 15
2.7 Dehydration synthesis of a tripeptide by the formation of two peptide bonds 19
2.8 Planar peptide groups and torsion angles in a polypeptide . . . . . . . . . . 20
2.9 Ramachandran plot representing stereochemically allowable torsion angles = 21
2.10 3D representation of the multi-helical structure of the myoglobin protein . . 22
2.11 Comparison of parallel -sheets and anti-parallel -sheets . . . . . . . . . . 22
2.12 Examples of typical protein quaternary structures . . . . . . . . . . . . . . 24
2.13 Dynamic programming matrix for the Smith-Waterman algorithm . . . . . 30
2.14 Training set classied without error by multiple suitable hyperplanes . . . . 35
2.15 Maximum-margin hyperplane built by an SVM . . . . . . . . . . . . . . . . 36
2.16 Mapping input data to a higher dimensional feature space using kernel tricks 37
2.17 Ecient representation of a moving square in a constraint database . . . . . 40
2.18 Two methods to dene the surrounding area around an interpolated point . 41
3.1 Outline of the PROFESS database . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 The modules Ligands, Protein Interactions and Functions for COG 329 . . 52
3.3 Module Function Summary of the PROFESS database . . . . . . . . . . . . 54
3.4 The modules Structure and Sequence-based Phylogeny for COG 12 . . . . . 55
8.1 Comparison of the reclassication with an oracle and reclassication with
constraint databases methods . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Comparison of the Reclassication with constraint databases and the original
ID3 decision tree for the prediction of the fuel eciency of cars . . . . . . . 117
8.3 Comparison of the Reclassication with constraint databases and the original
ID3 decision tree for the prediction of the origin of cars . . . . . . . . . . . 118
8.4 Comparison of the Reclassication with constraint databases and the Reclas-
sication with an oracle using the CRCARS database . . . . . . . . . . . . 118
8.5 Prediction of DISEASE from the PBC data using SVMs . . . . . . . . . . . 119
8.6 Prediction of DRUG from the PBC data using SVMs . . . . . . . . . . . . . 119
8.7 Prediction of DISEASE from the PBC data using ID3 . . . . . . . . . . . . 120
8.8 Prediction of DRUG from the PBC data using ID3 . . . . . . . . . . . . . . 120
8.9 Prediction of DISEASE DRUG with the PBC data using SVMs . . . . . . . 121
8.10 Prediction of DISEASE DRUG with the PBC data using decision trees . . 121
9.1 Comparison of the standard and the temporal classication methods. . . . . 123
9.2 Comparison of regular and temporal classication using 40 features and SVM 128
9.3 Comparison of regular and temporal classication using 40 features and ID3 128
9.4 Comparison of regular and temporal classication using 3 features and SVM 129
9.5 Comparison of regular and temporal classication using 3 features and ID3 129
9.6 ROC analyses of IDW and temporal SVMs/ID3 using temporal data . . . . 131
9.7 ROC analyses of IDW and temporal SVM/ID3 using spatio-temporal data . 133
also contain the machinery necessary to store energy, manufacture their components, etc...
Despite their complexity, cells also share structural components that are conserved across
most { if not all { organisms.
The remaining of this review of biological principles is organized as follows: section 2.1.1
explains in more details the four classes of biomolecules. Sections 2.1.2 and 2.1.3 gives a
general description of the processes that allow the synthesis of proteins from DNA. Finally,
section 2.1.4 describes in details the four levels of protein structures.
2.1.1 Biomolecules
All known forms of life depends on fours classes of biomolecules:
Carbohydrates or saccharides are the most abundant biomolecules. They play many
roles in cells, from energy storage to structural components.
Lipids, which are used in cell membranes and as ecient energy source.
Nucleic acids carry the genetic information of the organism.
Proteins include enzymes that perform most biochemical reactions for cell regulation.
2.1.1.1 Carbohydrates
Carbohydrates are simple organic compounds that have the empirical formula (CH2O)n.
The simplest carbohydrates are called monosaccharides or sugars. Glucose is the six-carbon
monosaccharide used as a basic source of energy by most heterotrophic organisms, that is,
organisms that use organic carbon for growth. Ribose and deoxyribose are the ve-carbon
sugars that serve a structural role in the nucleic acids RNA and DNA respectively. Sucrose
(see Figure 2.1y) is a disaccharide composed of glucose and fructose (an isomer of glucose)
and is the major sugar transported between cells in plants, whereas glucose is the primary
sugar transported in animal cells. Most carbohydrate molecules in nature are composed of
hundreds of sugar units and are referred as polysaccharides.
Other types of molecules, which are beyond the scope of this work, also play a critical role in cells.
yFrom http://www.bio.miami.edu/cmallery
Figure 2.1: Carbohydrate sucrose is a disaccharide composed of glucose and fructose. In
presence of water, sucrose can be hydrolyzed by an enzyme called sucrase.
Carbohydrates play several major functions in living organisms. For example, monosac-
charides serve as readily utilizable energy sources. Carbohydrates also performs struc-
ture roles, such as cellulose in plant cell walls and chitin in the exoskeletons of arthro-
pods. Surface carbohydrates often form complexes with proteins known as glycoproteins,
or with lipids. The great potential for structural diversity and thus, specicity, makes these
molecules very useful as cell markers in cellular communication.
2.1.1.2 Lipids
Lipids are small non-polar molecules that are insoluble in water. The most important feature
of lipids is their ability to form sheetlike membranes. Membranes in both prokaryotic and
eukaryotic cells separate the cellular content from the external environment, thus allowing
the cell to function as a unit. Lipids also serve as highly ecient energy storage molecules.
2.1.1.3 Nucleic Acids
Nucleic acids occur in two forms: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA).
Both are linear unbranched polymers of subunits called nucleotides. DNA is found in the
nucleus of eukaryotes and the cytoplasm or nucleoid of prokaryotes, and is the molecule
that contains the genetic material of the organism. RNA molecules are synthesized on
DNA templates (see Section 2.1.2) and participate in protein synthesis in the cytoplasm (see
Figure 2.2: DNA is composed of a phosphate group, a deoxyribose sugar (cyan) and four
nitrogen-containing organic bases: adenine (pink), cytosine (purple), guanine (green) and
thymine (yellow). The DNA molecule is composed of two complementary strands, where
adenine bonds to thymine, and cytosine bonds to guanine.
Section 2.1.3). DNA is usually found in the form of a helix composed of two complementary
strands whereas RNA is single-stranded.
Each nucleotide consists of three major parts:
1. A ve-carbon carbohydrate (pentose).
2. A negatively charged phosphate group, which gives the polymer its acidic property.
3. A nitrogen-containing organic base.
The sugar -D-ribose is found in the ribonucleotide monomers of RNA. The pentose in
the deoxyribonucleotide monomers of DNA dier only by the absence of oxygen at the #2
carbon and is thus called 2-deoxy--D-ribose. The organic bases are of two types: single-
ringed pyrimidines and double-ringed purines. The purines are adenine (A) and guanine.
The pyrimidines are cytosine (C), thymine (T) and uracil (U). Thymine is primarily found
in DNA, whereas uracil is found in RNA only.
Historically, DNA was rst discovered in 1869 when Johann Friedrich Miescher isolated
a substance he called \nuclein" from nuclein of white blood cells. In the early 1900s,
the four organic bases of DNA were known. In the 1920s, nucleic acids were classied
into two classes: DNA and RNA. Interestingly, for nearly eighty years, little attention
if any was given to DNA, because it was thought to be a simple polymer incapable of
encoding sophisticated genetic information as hypothesized by the accepted Schrodinger's
code script. Schrodinger's theory was proved to be incorrect in 1944 by Oscar et al.
when they demonstrated that genes indeed reside on DNA. In 1950, Charga discovered
an exact one-to-one ratio of the adenine/thymine and cytosine/guanine content21. A year
later, Wilkins22 and Franklin23 obtained the rst X-ray images of DNA, which suggested a
helical structure. The actual double helical structure the the DNA molecule was determined
in 1953 by James Watson and Francis Crick24.
\It has not escaped our notice that the specic pairing we have postulated
immediately suggests a possible copying mechanism for the genetic material."
Watson and Crick, 1953
Each type of base on one strand forms a bond with just one type of base on the other
strand (see Figure 2.2). This is called complementary base pairing. Purines form hydrogen
bonds to pyrimidines, with adenine pairing only with thymine, and cytosine only with
guanine. This arrangement of two nucleotides binding together across the double helix is
called a base pair. As hydrogen bonds are not covalent, they can be broken and rejoined
relatively easily. DNA strands start at the 5'-end of the nucleotide chain and terminates
at the 3'-end of the molecule. The sugar-phosphate backbone is on the outside of the helix
where the polar phosphate groups interact with the environment. The nitrogen containing
bases are inside, stacking perpendicular to the helix axis. On the contrary, RNA molecules
are synthesized as single stranded molecules. The single RNA strand may however fold
onto itself and form complementary base pairs to make unique secondary structures. Such
Schrodinger suggested that the rules dening life was encrypted in a complex instruction book.
Figure 2.3: The central dogma of molecular biology stipulates that the genetic information
in DNA, which can self-replicate, is used to make RNA molecules through a process known
as transcription, and that the information in RNA is used to synthesize proteins which have
a function by a process called translation.
responsible for our body's defense, is based on specic structure recognition. At a molecu-
lar level, such recognition processes consist of protein-protein interactions on the surface of
immune system' cells.
Proteins typically contains thousands of atoms, that may be dicult to decipher. In
order to simplify protein description, four structural levels are usually documented. These
hierarchical levels describe increasingly complex protein structure level and are referred as
primary, secondary, tertiary and quaternary structures of proteins. These structure levels
are described in greater details in Section 2.1.4.
2.1.2 Transcription: from DNA to RNA
Shortly after the double helical structure of DNA was discovered, the hypothesis that nu-
cleic DNA functions as the template for mRNA molecules, which subsequently move to the
cytoplasm where they are used to determine the amino-acid sequence of proteins, was ac-
cepted. This pathway for the
ow of genetic information was referred by Crick & Watson 27
in 1956 as the central dogma of biology (Figure 2.3). Note that the arrows in this pathway
are unidirectional, meaning that polypeptide templates are never used to synthesize mRNA
strands and mRNA templates are not used to synthesize DNA strands. More than 50 years
later, the central dogma remains essentially valid.
The genetic information in DNA { or gene { is used to synthesize mRNA molecules
through a process known as transcription, which is carried out by RNA polymerase enzymes.
The synthesis of genetic material from RNA strands may actually occur in rare occasions. Retro-viruses
such as the Human Immunodeciency Virus are examples of this process, also known as retro-transcription.
The information for synthesizing a specic mRNA strand is located in only one of the two
DNA strands. The strand that actually contains the usable genetic information to make
the mRNA molecule is called the template { or sense { strand. Its complementary strand
is usually called the nonsense strand, because it contains no useful information for the
synthesis of this specic mRNA. Note that DNA templates coding for mRNAs are not all
on the same DNA strand.
Transcription in prokaryotic cells diers from transcription in eukaryotic cells. A major
dierence is that the genetic material in eukaryotic cells is located in a well-dened nu-
cleus. Transcription in eukaryotic cells hence occur in the nucleus. mRNA is subsequently
transported to the cytoplasm to be translated.
Another major dierence is that eukaryotic genes usually alternate coding sequences
called exons and non-coding sequences called introns. Transcription in eukaryotic cells
hence produces pre-mRNA strands rather than mRNA. After transcription, pre-mRNA
undergoes signicant processing before being transported in the cytoplasm. The 5' end
is capped with a 7-methylguanine, which ensures stability during translation. The 3' end
is polyadenylated. This addition of a poly-A tail at the 3' end plays various roles, from
enzymatic degradation protection to transcription termination. Finally, the pre-mRNA is
converted into mRNA by the excision of introns and the splicing of the remaining exons.
2.1.3 Translation: from RNA to Proteins using the Genetic Code
Translation is the production of proteins by decoding mRNA produced during the tran-
scription (Section 2.1.2) of DNA. Translation is performed in the cytoplasm by a complex
biological machinery, which includes in particular large rRNA molecules called ribosomes.
Ribosomes are made of two subunits which surround the messenger RNA (mRNA) and
produce a specic polypeptide according to the rules specied by the genetic code.
The genetic code, which denes the codons coding for a specic amino-acid, was de-
termined experimentally. The problem to solve was the following: given the existence of
twenty amino-acids and only four bases, how to group nucleotides to encode amino-acids?
Pairs of two nucleotides would only specify 16 (= 4 4) amino-acids are therefore insu-
cient. Hence, as soon as 1954, focus was given to triplets of three nucleotides because they
allows 64 (= 4 4 4) possible permutations, enough to encode the 20 amino-acids. This
.
B
A
C
K
G
R
O
U
N
D
A
N
D
R
E
L
A
T
E
D
M
A
T
E
R
IA
L
1
7
Table 2.2 (continued): Main structural and functional roles of the 20 standard amino-acids
Name Abbr. Structure Main structural and functional roles in proteins
Glutamine Q - Gln Similar to Asparagine.
Glycine G - Gly
Glycine is a unique amino acid because it contains a hydrogen as its side chain
(instead of a carbon). This allows a greater conformational
exibility of the
protein. Glycine can reside in parts of protein structures that are forbidden to
all other amino acids and use its sidechain-less backbone to bind to phosphates.
Histidine H - His
Histidines are the most common amino acids in protein active or binding sites
because it has a pKa near to that of physiological pH. They are very common
in metal binding sites, often acting together with cysteines.
Isoleucine I - Ile
Similar to leucine. In addition, like threonine and valine, isoleucine is C-beta
branched. It is therefore more restricted in the conformations the main-chain
can adopt. It often lies within beta-sheets.
Leucine L - Leu
The hydrophobic leucine is more likely to be buried in protein hydrophobic
cores.The side chain is fairly non-reactive, and is thus rarely directly involved
in protein function, though it can play a role in substrate recognition. In
particular, phenylalanine can be involved in binding/recognition of hydrophobic
ligands such as lipids.
Lysine K - Lys Similar to Arginine
Methionine M - Met
Similar to leucine. Methionine is always the rst peptide of a protein. Methio-
nine also contains a sulphur atom, that can be involved in binding to atoms
such as metals. It is however connected to a methyl group making less reactive
than the sulphur in cysteines.
Continued on next page. . .
.
B
A
C
K
G
R
O
U
N
D
A
N
D
R
E
L
A
T
E
D
M
A
T
E
R
IA
L
1
8
Table 2.2 (continued): Main structural and functional roles of the 20 standard amino-acids
Name Abbr. Structure Main structural and functional roles in proteins
Phenylalanine F - Phe
Phenylalanines prefer to be buried in protein hydrophobic cores. The aromatic
side chain can also mean that tryptophans are involved in stacking interactions
with other aromatic side-chains.
Proline P - Pro
Proline is unique because its side chain is connected to the protein backbone
twice. This important dierence means that proline is unable to occupy many
of the main chain conformations. Hence, proline is often found in very tight
turns in protein structures. It can also function to introduce kinks into alpha
helices, since it is unable to adopt a normal helical conformation.
Serine S - Ser
Serine can reside both within the interior of a protein, or on the protein surface.
Its small size means that it is relatively common within tight turns on the protein
surface, where its side-chain hydroxyl oxygen to form a hydrogen bond with
the protein backbone. The hydroxyl group is fairly reactive, being able to form
hydrogen bonds with a variety of polar substrates.
Threonine T - Thr
Similar to serine. In addition, like valine and isoleucine it is C-beta branched,
which restricts the conformations the protein can adopt.
Tryptophan W - Trp
Similar to phenylalanine. Tryptophans' nitrogens can play a role in binding to
non-protein atoms, but such instances are rare.
Tyrosine Y - Tyr
Similar to phenylalanine. However, unlike phenylalanine, tyrosine contains a
reactive hydroxyl group, thus making it much more likely to be involved in
interactions with non protein atoms.
Valine V - Val
Similar to leucine. In addition, like isoleucine and threonine it is C-beta
branched, which restricts the conformations the protein can adopt.
Figure 2.8: Short section of polypeptide chain showing the planar peptide groups and
identifying the torsion angles and .
and the amine of the peptide group to lie on the same plane (see Figure 2.8). Therefore,
the only source of conformational freedom that the polypeptide possesses comes from the
torsional rotation around its single bonds. There are only two remaining single bonds per
residue along the main chain, and from these bonds one may associate the torsion angles
and . Not all combinations of and are stereochemically possible, since many lead to
steric hindrance. Ramachandran41,42 showed that only about one third of the = space
is stereochemically accessible to amino acid residues in a real polypeptide. The = space
can be represented in a two-dimensional hyperspace, also known as the Ramachandran plot
(see Figure 2.9), where the two variables and may vary between 180 and +180. It
should be noted that glycine side chain is composed of a hydrogen atom. Hence it is less
restrictive. This can be visualized in the Ramachandran plot where the allowable area is
considerably larger when glycine is part of the polypeptide (lighter shade).
Pauling et al.43,44,45,46,47 analysed the geometry and dimensions of the peptide bonds in
the crystal structures of molecules containing either one or a few peptide bonds. Two main
classes of patterns in proteins: -helices and -sheets. Any pattern that does not fall in one
of those two classes is called random coil. On average, about 50% of the amino acids are in
a secondary structure, among which 54% are in an -helix and 46% are in a structure.
Table 2.3 summarizes the values of the torsion angles for the main secondary structures.
From http://www.ncbi.nlm.nih.gov/
Figure 2.10: A representation of the 3D structure of the myoglobin protein. The eight
-helices are shown in color and represent 70% of the protein's structure.
Scholtz 48 showed that methionine, alanine, leucine, uncharged glutamate, and lysine have
high helix-forming propensities, whereas proline, glycine and negatively charged aspartate
have poor helix-forming propensities. Proline tends to kink { or break { helices because it
has no amide hydrogen to donate. However, proline is often seen as the rst residue of a
helix.
A -strand is a polypeptide segment where the torsion angle is about 120 . As a
result, the sidechains of two neighboring residues in this segment point in the opposite
direction from the backbone. A -sheet is composed of two or more strands, linked together
by hydrogen bonds between amine groups of one strand and carbonyl groups on the other
strand. If the strands all run in one direction, the sheet is called parallel -sheet whereas
Figure 2.11: Comparison of parallel -sheets and anti-parallel -sheets. Hydrogen bonds
between strands in parallel -sheets are not straight and thus weaker.
Figure 2.12: Examples of typical protein quaternary structures. Proteins may be composed
of two (dimers), three (trimers) or more polypeptides, giving the protein its nal shape.
From Protein Structure and Function34, reproduced with BioMed Central's authorization.
relatively slow running time. Section 2.2.1.3 presents a very popular heuristic called Basic
Local Alignment Search Tool55 (BLAST) which dramatically increases the throughput.
2.2.1.1 Scoring Schema
When comparing protein sequences, one is looking for evidence that the sequences have
diverged from a common ancestor by a process of mutation/selection. Three basic types
of mutations are considered: substitutions, when a residue is changed for another one,
insertions and deletions, when a residue is added or removed from one of the two sequences.
Insertions and deletions are referred as gaps.
The total score of the alignment is the sum of each of the mutations. As mentioned
in Section 2.1.4.1, some amino-acids substitutions are more likely to occur than others,
based on their chemical properties. For example, both serine and threonine have a reactive
hydroxyl group, which easily forms hydrogen bonds with a variety of polar substrates.
Hence, eective substitutions of serine for threonine are expected to occur quite frequently.
On the contrary, arginine usually interacts with anions whereas aspartate usually interacts
with cations. Hence, such a substitution is not expected to happen often. These substitution
likelihoods are usually represented in the form of a table, called substitution matrix. Each
matrix is twenty-by-twenty (for the twenty standard amino acids); the value in a given cell
represents the probability of a substitution of one amino acid for another.
Point Accepted Mutation (PAM) matrices
Dayho et al. 56 developed one of the rst substitution matrix referred as the Point Accepted
Mutation (PAM) matrix. The PAM matrices were derived from 1,572 observed mutations in
71 families of closely related proteins. The PAM matrices are normalized so that the PAM1
matrix has one mutation per hundred amino acids, and is appropriate for scoring sequences
which are very similar. PAM matrices for comparing sequences of lower similarity are
calculated from repeated multiplication of the PAM1 matrix by itself. PAM2 is equivalent
to two substitutions per hundred amino acids and is dened by PAM2 = PAM21 . PAM30
and PAM70 are commonly used in practice.
BLOcks of Amino Acid SUbstitution (BLOSUM) matrices
Another popular substitution matrices family { BLOcks of Amino Acid SUbstitution { was
introduced by Heniko & Heniko 57 , who scanned the BLOCKS database58 for gapless con-
served regions of protein families and then counted the relative frequencies of amino acids
and their substitution probabilities. BLOSUM matrices are based on observed alignments,
without considering closely related proteins like the PAM matrices. Several BLOSUM ma-
trices were built, using dierent degrees of protein conservation: the conservation percentage
used was appended to the name. For example, BLOSUM80 corresponds to the matrix built
with sequences that were more than 80% identical. Raw substitution probabilities are in
average 1=20. The log-odds score sij for each of the 210 possible substitutions of the twenty
standard amino acids were calculated using Equation 2.1.
sij =
1
log(
pij
qiqj
) (2.1)
where, pij is the probability of two amino acids i and j replacing each other in a homologous
sequence, and qi and qj are the background probabilities of nding the amino acids i and j
Table 2.4: The BLOSUM62 substitution matrix. Highest scores represent most conserva-
tive substitutions. The matrix is symmetric and that higher scores are in the main diagonal.
in any protein sequence at random. The constant is a scaling factor, such that the matrix
contains easily computable integer values.
Log-odd ratios are a convenient way to represent small probabilities in a \human-
readable" format. BLOSUM62 (see Table 2.4) turned out to give best results in practice.
Note that the matrix is symmetric, meaning that the probability that the amino-acid i
mutates into j is equal to the probability that j mutates into i.
2.2.1.2 Smith-Waterman Algorithm
Given a scoring scheme and two sequences, we need to nd the optimal alignment. Assuming
both sequences are composed of n residues, there arey:
2n
n
=
2n!
n!2
22n
p
n
(2.2)
possible alignments between two sequences of length n. The average length of amino-acids
is about 300 residues, leading to 1:410179 alignments to consider. Optimistically assuming
that 1,000,000 alignments can be computed every second, it would take 4:3 10165 years
to enumerate all the possible alignments, that is, 3:1 10155 times the age of the known
universe. A brute force approach, consisting of enumerating all possible alignments and
then choosing the best one, is clearly not possible. Dynamic programming algorithms must
be used instead.
The rst dynamic programming algorithm for global sequence alignment was devised
by Needleman & Wunsch 61 . The algorithm was later improved by Gotoh 62 . However,
protein sequences usually contain a number of irrelevant residues at the extremities of the
polypeptide chain. Hence, Smith & Waterman 54 proposed a local sequence alignment
algorithm, also based on a dynamic programming approach, which compute the best align-
ment of subsequences from both protein. Local alignments are usually favored because they
are more sensitive to capture specic conserved domains. When using global alignment
Styczynski et al. 59 showed in 2008 that the BLOSUM62 used as a standard since 1996 is not exactly
accurate according to the algorithm described by Pietrokovski et al. 58 .
yUsing Stirling's approximation of large factorials60
methods, such domains may be more dicult to detect because of noisy mutations in the
remaining of the sequence.
The idea of the algorithm is to build an optimal alignment using previous known optimal
alignments of subsequences. The nal alignment is hence computed by (1) solving the
problem for shorter { and easier to compute { subsequences, (2) combining the individual
solutions of the smaller alignments. The alignment is computed by constructing a matrix
D, indexed by i 2 [1::n] and j 2 [1::m], one index representing each of the two sequences
of length n and m respectively. Let x1::i be the subsequence of x from the rst to the ith
residue. The value D(i; j) represents the best alignment between x1::i and y1::j and can be
computed recursively as shown in Equation 2.3.
D(i; j) = max
8
>><
>>:
0
D(i 1; j 1) + s(xi; yj)
D(i 1; j) d
D(i; j 1) d
(2.3)
where s(xi; yj) is the score for the substitution of residue i from x by residue j of y given by
the scoring substitution matrix and d the linear cost for gaps. The option 000 corresponds
to starting a new alignment: if the best alignment becomes negative, it is better to start
a new one instead of extend the old one. The initial conditions for i = 0 or j = 0 are
dened by D(i; j) = 0. Given D, the alignment can then be easily obtained by retrieving
the path in the matrix that was necessary to follow to calculate F (n;m). This process is
called backtracking.
For example, consider x = PAWHEAE and y = HEAGAWGHEE. Using the BLOSUM62 scoring
matrix and a linear gap penalty of 5. The resulting matrix D is shown in Figure 2.13. Using
the matrix D, we can infer the local alignment:
AWGHE
AW-HE
The complexity of the Smith-Waterman is quadratic { O(nm), with n and m, the length
of the two input sequences {, which greatly improves the exponential complexity of the brute
force approach. In addition, this algorithm is \correct" in the sense that it guaranties to
nd the optimal alignment given a scoring scheme.
Figure 2.13: Dynamic programming matrix for the Smith-Waterman alignment of se-
quences PAWHEAE and HEAGAWGHEE using the BLOSUM62 scoring matrix and a linear gap
penalty of 5. D(i; j) represents the best alignment between x1::i and y1::j . The backtrack
path used to construct the local alignment is shaded.
Variants of the original pairwise algorithm have been proposed for multiple sequence
alignment, which allows one to align several sequences together. However, the multiple
sequence alignment problem proved to be NP-completed63,64. To reduce the complexity of
the problem, heuristics have been proposed and implemented in ClustalW65,66, which we
used in our structural comparison of functional orthologs (see Chapter 4).
2.2.1.3 Basic Local Alignment Search Tool: BLAST
However, in some cases, the complexity improvement may not be sucient. In particular,
sequence alignments are typically used to match a sequence of interest with a sequence in a
database of known proteins. Such databases usually contains millions of protein sequences.
In that case, one must perform pairwise alignments between the protein of interest and each
of the proteins within the database. For this reason, there has been many attempts to pro-
duces non-optimal but faster algorithms67,68,69. However, those algorithms did not perform
well when used with standard scoring matrices. Hence, new heuristics were developed, in
particular FAST-All70 and the Basic Local Alignment Search Tool55 (BLAST). The latest
was used to built our PROFESS database (see Chapter 3).
The idea behind the BLAST heuristic is that true alignments are likely to contain short
highly conserved subsequences or identities. BLAST will hence look for such subsequences
{ or seeds { which then can be extended. The seeds are normally kept short so that a table
with all possible seeds can be preprocessed and used later as a lookup table. Then, BLAST
will try to extend seeds with a word of given length (3 by default for protein sequences,
11 for nucleic acid sequences) that matches the query sequence with a score higher than
a given threshold. The process is referred as the hit extension. The algorithm terminates
whenever the score drops below a parametrized expectation threshold. The original BLAST
algorithm only found ungapped alignments. However, more recent versions are able to
output gapped alignments71,72,73 and greatly improved performance when querying large
sequence databases74.
2.2.2 Protein Structure Alignments
In addition to aligning protein sequences, there have been a number of attempts to align
the tridimensional structure of proteins. Structural alignment provide valuable additional
information as a single mutation in the amino-acids sequence can dramatically change the
corresponding 3D structure of the protein. For example, leucine is often found in alpha
helices. A mutation of this amino-acid into proline is likely to kink the alpha helix because
of the unique structure of proline.
These algorithms take as input the atomic coordinates of the proteins to align, and
output the superposed atomic coordinates. The algorithms also output the minimal root
mean square deviation (RMSD) between the structures, which measures the structural
divergence of aligned proteins. When aligning structures with very signicantly divergent
sequences, most structural alignment methods consider only the backbone atoms included in
the peptide bond. The coplanarity of the peptide bond is used to maximize throughput and
the -carbon coordinates alone are usually considered for the alignment. The remaining
atoms, in particular the side chains, are used to generate the nal alignment only when
the RMSD drops below a given threshold, that is, when protein structures are similar
enough. Like multiple sequence alignments, the multiple structure alignment proved to be
NP-Complete. Hence, approximate polynomial-time solution have been devised by Kolodny
& Linial 75 and Zhu 76 . Ye & Godzik 77 also proposed a solution that utilizes graph theory.
Several pieces of software for structural alignment have been implemented. Most popular
program include Dali78,79, Combinatorial Extension80 (CE), MAtching Molecular Models
Obtained from THeory81 (MAMMOTH) and its multiple alignment extension MAMMOTH-
mult82, and Sequential Structure Alignment Program83 (SSAP) which was used to build the
Class, Architecture, Topology, Homology (CATH) database51. Our PROFESS database was
constructed in part by using Dali and MAMMOTH-mult.
2.3 Supervised Machine Learning Algorithms
In many problems, we need to classify items, that is, we need to predict some characteristic
of an item based on several of its parameters. Each parameter is represented by a variable
which can take a numerical value. Each variable is called a feature and the set of variables
is called the feature space. The number of features is the dimension of the feature space.
The actual characteristic of the item we want to predict is called the label or class of the
item.
To make the predictions, we use classiers. Each classier maps a feature space X to
a set of labels Y . Classiers are built using machine learning algorithms, which are able to
automatically improve by the analysis of data sets, i.e. they learn by experience. Speech
or handwriting recognition are typical applications of machine learning approaches. Also
in computational biology various machine learning techniques have been successfully used,
for example neural networks for detection of signal peptides in proteins84, Hidden Markov
models for protein homology detection85 and stochastic context free grammars for modeling
and prediction of RNA secondary structures25,86.
In this work, we are interested in linear classiers, that is, a classier that can be
mathematically dened by a linear equation. We are also assuming that the set of labels
used during the training stage is known a priori, which is also known as supervised learning.
Hence, this work does not consider the variety of unsupervised learning algorithms, which
include in particular clustering and segmentation techniques.
After brie
y introducing linear classiers in section 2.3.1, we describe two linear classi-
ers: Support Vector Machines in section 2.3.2 and Decision Trees in section 2.3.3.
2.3.1 Linear Classiers
A linear classier maps a feature space X to a set of labels Y by a linear function. In other
words, a linear classier computes the label of an item to classify using a linear function.
In general, a linear classier f( !x ) can be expressed as follows:
f( !x ) = h !w !x i+ b =
X
i
wixi + b (2.4)
where wi 2 R are the weights of the classiers and b 2 R is a constant. The value of f( !x ) for
any item !x directly determines the predicted label, usually by a simple rule. For example,
in binary classications, if f( !x ) 0, then the label is +1 else the label is 1. Note that
the knowledge of the weights wi is necessary and sucient to dene the linear classier f .
Example 2.1 Suppose that a disease is conditioned by two antibodies A and B. The feature
space isX = fAntibody A;Antibody Bg and the set of labels is Y = fDisease;No Diseaseg,
where Disease corresponds to +1 and No Disease corresponds to 1. A linear classier is:
f(fAntibody A;Antibody Bg) = w1Antibody A+ w2Antibody B + b
where w1; w2 2 R are constant weights and b 2 R is a constant. We can use the value of
f(fAntibody A;Antibody Bg) as follows:
If f(fAntibody A;Antibody Bg) 0 then the patient has Disease.
If f(fAntibody A;Antibody Bg) < 0 then the patient has No Disease.
The set of linear classication methods includes a number of algorithms, ranging from
generative probabilistic models such as Naive Bayes algorithms, where a model is built using
some data during the training stage, to discriminative models such as Logistic Regression
algorithms, Decision Trees or Support Vector Machines, where the model is obtained by
applying constraints derived from the training data. The methods proposed in this work are
applicable to any linear classier in general. During our experiments, we used support vector
machines and decisions trees, which are described in sections 2.3.2 and 2.3.3 respectively.
Figure 2.14: A set of training examples with labels +1 () and 1 (). This set is
linearly separable because a linear decision function in the form of a hyperplane can be
found that classies all examples without error. Two possible hyperplanes that both classify
the training set without error are shown (solid and dashed lines). The solid line is expected
to be a better classier than the dashed line because it has a wider margin, which is the
distance between the closest points and the hyperplane.
equals 2=k !w k as shown in Figure 2.15. Maximizing the margin is then equivalent to solving
the following optimization problem:
min !w 2Rn;b2R
1
2
k !w k2 (2.6)
subject to:
jf( !xi)j 1 (2.7)
Equation 2.6 leads to the following solution:
f( !x ) =
lX
i=1
iyih !xi !x i+ b (2.8)
where i are positive real coecients and yi 2 f+1; 1g is the label of !xi .
However, in practice, data is rarely linearly separable, and Equation 2.8 may not be used.
Vapnik12 combined in 1995 the technique described above with a mathematical method
called the kernel trick introduced in 1964 by Aizerman et al.89. The kernel trick consists in
a mapping function (see Figure 2.16) such that data, which is not linearly separable in
the input feature space, may be linearly separable in a higher dimension feature space.
Figure 2.16: Mapping input data in to a higher dimensional features space. Problems that
are not linearly separable in input space can be linearly separable in feature space.
Let !x = (x1; x2; :::; xn). Combining Equations 2.9 and 2.10, f( !x ) becomes:
f( !x ) =
lX
i=1
0
@iyi
nX
j=1
xijxj
1
A+ b
f( !x ) =
nX
j=1
lX
i=1
iyixij
!
xj + b
f( !x ) =
nX
j=1
wjxj + b (2.11)
with wj =
Pl
i=1 iyixij .
2.3.3 Decision Trees
Decision trees were frequently used in the nineties by articial intelligence experts because
they can be easily implemented and they provide an explanation of the result. A decision
tree is a tree with the following properties:
Each internal node tests an attribute,
Each branch corresponds to the value of the attribute,
Each leaf assigns a classication.
Figure 2.17: Moving square that may be eciently represented in a constraint database.
Example 2.2 For example, Figure 2.17 shows a moving square, which at time t = 0 starts
at the rst square of the rst quadrant of the plane and moves to the northeast with a speed
of one unit per second to the north and one unit per second to the east.
When t = 0, then the constraints are x 0; x 1; y 0; y 1, which is the unit
square in the rst quadrant. We can calculate similarly the position of the square at
any time t > 0 seconds. For example, when t = 5 seconds, then the constraints become
x 5; x 6; y 5; y 6, which is another square with lower left corner (5; 5) and upper
right corner (6; 6). In general, for any time t, the constraints describing the moving square
in a constraint database can be expressed as in Table 2.5.
2.5 Inverse Distance Weighted Interpolation
Most scientic experiments are based on data collection. A data set thus formed is usually
composed of discrete values obtained after some data sampling method. For example, in
the case of meteorological studies, the temperature may be measured every hour. However,
although this meteorological data set will give accurate values when the temperature was
measured, it does not provide any information about the temperature evolution between
Table 2.5: Constraint database representation of a moving square
X Y T
x y t x t; x t+ 1; y t; y t+ 1; t 0
electronics led to the development of database models that are far more ecient for dealing
with large volumes of information than
at databases. The most notable advance is the
relational model, which was proposed by Codd in 1970102, making databases an ecient
tool to support the research community. Since then, the number of databases has dramat-
ically increased117, and the 2009 Molecular Biology Database Collection118 includes 1,170
databases, each containing thousands, if not millions, of entries. These databases constitute
the extent of our knowledge related to genomics, proteomics, metabolomics and structural
genomics. However, most serve only as data warehouses with simple interfaces for data
retrieval3.
3.2 Database Integration Problem
To address increasingly complex questions, biologists are routinely required to develop new
databases by ltering information from existing databases119. Even though this is extremely
inecient, there are a growing number of specialized databases designed around single
topics. Unfortunately, this simply propagates the underlying inability to utilize the data
outside the constraints imposed by the database designers120.
As an example, the Protein FAMilies121 (PFAM) database contains large collection of
high quality, manually curated families, which are helpful to identify conserved domains that
occur within proteins and therefore provide insights into the proteins functions. Besides
sequence-based queries using BLAST74, the primary means to mine data is based on a
randomly assigned accession number or keywords. There exist only very limited interactions
with other databases. As a consequence, the set of possible queries available to mine the
data is limited as well. In particular, it is not possible to search PFAM for enzyme classes122
that are common to a specic family for example.
Capitalizing on the potential of biological information requires the development of a
next-generation database that enables biologists to explore biological data in new ways.
The key to solving this problem is to move the design focus from the database structure
to a
uid association that can be adapted to a biologists questions123 without re-designing
the underlying data structure. However, there are barriers to linking individual databases
because of dierent data formats and structure124,125.
.
T
H
E
P
R
O
F
E
S
S
D
A
T
A
B
A
S
E
4
8
Table 3.1: Core databases currently integrated in PROFESS (last update May 2009)
Name Code Level Link Ref.
CATH database CATH Structure http://www.cathdb.info/ 51
Clusters of Orthologous Groups of proteins COG Function http://www.ncbi.nlm.nih.gov/COG/ 129
Enzyme Classication EC Function http://www.chem.qmul.ac.uk/iubmb/enzyme/ 122
Database of Essential Genes DEG Evolution http://www.essentialgene.org/ 130
Database of Interaction Proteins DIP Function http://dip.doe-mbi.ucla.edu/ 131
Functional structure/sequence-based phylogeny Evolution Proposed in Chapter 4
Functional structure similarity comparisons Structure Proposed in Chapter 4
Gene Ontology GO Function http://www.geneontology.org/ 132
GenBank GENBANK Sequence http://www.ncbi.nlm.nih.gov/Genbank/ 133
KEGG Ligands KEGG Function http://www.genome.jp/kegg/ligand.html 134
Protein Data Bank PDB Structure http://www.rcsb.org/ 7
Protein Families database PFAM Function http://pfam.sanger.ac.uk/ 121
Protein/Protein interactions in E. coli PIN Function http://genome.cshlp.org/content/16/5/686 135
Structural Classication of Proteins SCOP Structure http://www.bio.cam.ac.uk/scop/ 50
UniProtKB Taxonomy NEWT All http://www.uniprot.org/taxonomy/ 136
PROFESS was designed to assist in the functional and evolutionary analysis of proteins
continually identied from whole-genome sequencing. Hence, a requirement for PROFESS
was to be both versatile and extendable. For these reasons, the PROFESS database was
created using wrappers of a Local-As-View (LAV) approach (Section 3.5). The web interface
was also designed using a modular approach (Section 3.7), each module representing a
unique view of the data.
Figure 3.1 outlines the structure of our system: users interact with PROFESS through
a web interface using a functional-style query language that is translated to SQL for mining
PROFESS (A). The core of PROFESS (B) consists of the COG-PDB relationship (Sec-
tion 3.4.1). Other databases interact with PROFESS core through the use of wrappers. As
a consequence, all integrated databases can be uniformly queried (Section 3.6). The
exible
design of PROFESS coupled with user friendly searching capabilities makes PROFESS par-
ticularly useful for asking a range of questions about the sequence, structure and functional
relationship of orthologous proteins.
PROFESS is freely accessible through the URL http://cse.unl.edu/~profess. Data
can be downloaded as parseable les in Comma Separated Values (CSV) format from the
web-interface or using HTTP GET requests that may be batched in scripts.
Table 3.2: List of modules available in the user interface of PROFESS
Name Level Data sources
Essential genes Evolution COG, DEG, NEWT, PDB
Functions Function COG, EC, GO, KEGG, NEWT, PDB, PFAM
Functions summary Function COG, EC, GO, PDB, PFAM
Ligands Function COG, KEGG, PDB
Protein interactions Function COG, PIN
Sequences Sequence COG, GENBANK, NEWT, PDB
Sequence-based phylogeny Evolution COG, GENBANK, PDB
Sequence similarities Sequence COG, GENBANK, PDB
Structures Structure CATH, COG, NEWT, PDB, SCOP
Structure-based phylogeny Evolution COG, PDB
Structural comparisons Structure COG, NEWT, PDB
similar functionality. However, a major dierence with PROFESS is that a search on the
PDB for a specic ligand will return the list of all bound protein structures. PROFESS only
returns those which are orthologous thus expected to have similar functions. To provide
rapid access to biologically relevant data, common buers, detergents, ions and solvents are
listed separately.
Module Protein/Protein Interactions
The interactions between proteins are critical for most biological functions. For example,
signal transduction is based on interactions between extra-cellular signaling molecules and
membrane proteins, and plays a fundamental role in many biological processes and in many
diseases. The Protein Interactions module (Figure 3.2c) lists protein interactions found in
E. coli. The data was extracted from Arifuzzaman et al. 135 and interactions were correlated
to the corresponding protein structures by matching bait and prey genes to their represen-
tative COG. The Protein Interactions module also integrates the 55,692 manually curated
protein/protein interactions (as of August 2009) in 243 organisms from the Database of
Interacting Proteins131.
Module Functional Summary
The Function level of PROFESS also summarizes the main biological functions of an
orthologous cluster (example of Dihydrodipicolinate synthases (COG 329) shown in Fig-
ure 3.3). For three primary descriptions of protein function { the Protein Families121, the
Enzyme Classication122 and the Gene Ontology132 { , the numbers of proteins within each
class (within the current orthologous cluster) are computed and the distributions are rep-
resented as pie charts. This synthetic view allows the user to quickly dierentiate relevant
classes from outliers. Classes are sorted by decreasing number of proteins. The darker the
color in the pie chart, the higher the number of proteins. We implemented PHP scripts to
generate the raw data whereas the pie chart are generated using the Google Chart API.
While the webpage displays chart thumbnails, the pie charts can also be downloaded in
high-denition.
Documentation available at http://code.google.com/apis/chart/
Figure 3.3: The module Function Summary represents the distribution of the proteins in
all PFAM, E.C. and GO classes within the current orthologous cluster.
3.4.3 Phylogenetic Level
Phylogeny is the study of the evolution of organisms and describes how they evolved to-
gether. The evolution and relations between organisms or proteins of interested are usually
represented in an evolution { or phylogenetic { tree. The Evolution level of PROFESS
displays a table of essential genes, and sequence and structure-based phylogenetic trees.
Module Sequence-based Phylogeny
The Sequence Tree (Figure 3.4a) shows the unrooted phylogenetic tree generated using
protein chain sequences from the PDB. First, the sequences were aligned using ClustalW265.
Second, the tree was computed using ClustalW2 using the multiple sequence alignment as
a guide. The nal image was generated using DrawTree from the PHYLIP package138.
Module Structure-based Phylogeny
The Structure Tree (Figure 3.4b) module shows the unrooted phylogenetic tree generated
using protein structures from the PDB. To increase throughput, the structures were aligned
using MAMMOTH-mult82, a tool for multiple structure alignments. The tree was computed
by ClustalW2 using the multiple structure alignment as a guide. The nal image was
generated using DrawTree. The two phylogenetic trees allow a quick visual comparison of
the divergence between sequences and structures of orthologous proteins. In both cases, the
tree can be downloaded as an ASCII le in PHYLIP format as well as in high-denition
picture.
(a) Module Sequence
(b) Module Sequence Similarities
Figure 3.6: The two modules (top rows only) from the Sequence level of the PROFESS
database for Dihydrodipicolinate synthases (COG 329). Moving the cursor over cross-
reference activates tooltips that provide additional information.
website rather than reproducing SCOP data on our pages.
Module Structure Comparisons
Additionally, the Structure level contains all pairwise structure alignments of an or-
thologous cluster (Figure 3.5b). The pairwise structure comparison tool DaliLite was used
to measure the backbone structure similarity of proteins within each orthologous cluster
dened by the COG database. All-against-all pairwise structural comparisons were carried
out for all COGs that were represented by a minimum of two organisms. The Dali Z-scores
were normalized to calculate a Fractional Structure Similarity (FSS) score:
FSS =
ZAB
max(ZAA; ZAA)
(3.1)
higher the normal form, the more robust the database structure is against inconsistencies.
PROFESS was designed using the fth normal form proposed by Fagin 142 . The resulting
Entity-Relationship diagram is shown in Figure 3.7.
However, selective denormalization was subsequently performed for performance rea-
sons143. In particular, the PROFESSor (see Section 3.6.1) queries data from a unique table
(precalc professor) that includes pre-computed joins between relations. To maintain the
data consist, routines were implemented along with the wrappers to regenerate the table
after data is inserted in PROFESS.
3.6 Functional-Style Query System
3.6.1 The PROFESSor
The primary search tool to query the database is the PROFESSor (Figure 3.6.1), a unied
text eld that will assist the user to easily rene complex queries by dynamically suggesting
entries from any integrated database.
The PROFESSor assists the user by correcting for spelling errors using DamerauLeven-
shtein metrics144. The Damerau-Levenshtein distance between two sequences of characters
is the minimum number of basic operations (substitution, insertion, deletion and transpo-
sition) necessary to transform one sequence into the other one. For example, the Damerau-
Figure 3.9: The PROFESSor is a search tool generated from the core databases. It displays
interactive suggestions to help the user to rene complex query. Using the PROFESSor,
users can quickly and accurately nd all functional, structure and sequence information
about a particular protein and its relation to other protein functions and folds.
Levenshtein distance between \homo sapiens" and \homu spaiens" is 2: u is changed to
o, and p and a need to be switched. This is an eective means to detect spelling errors
as Damerau 145 showed that 80% of misspellings can be corrected with one basic operation
only, which corresponds to a distance of one.
It also provides a user dened focused browsing feature. For instance, upon typing in
the query FAD (Flavin-Adenine Dinucleotide) the PROFESSor returns a drop down list of
protein folds and functions that have known relation with the FAD ligand (Figure 3). If
a user selects the ligand suggestion, PROFESS will return all functional clusters known to
interact with FAD. The PROFESSor searches all other data sources within PROFESS in
the same manner. A user can rapidly identify other protein functions that have the same
fold, bind similar ligands, or identify cellular localizations all in one search. In addition to
the PROFESSor, a traditional search form is also available.
The PROFESSor may also be queried using many keywords from several databases using
boolean logic. Using regular expressions, the general syntax for queries is dened as:
([KEY]0;1 \w ([OR] \w))([OR]0;1[KEY]0;1 \w ([OR] \w))
KEY depends on the database and may be one of the following (note that this list will
grow with the number of core databases): ALL, COG, EC, GO, LIGAND, PDB, PFAM.
By default, all keywords after a [KEY] are considered as a unique string for the query. This
behavior can be altered by prexing the keywords with [OR]. The wildcard characters %
(any number n of characters, with n 0) and (exactly one character) may be used in a
query. A logical AND is performed between dierent keys.
3.6.2 Functional-Style Query System
A fundamental component of our approach required the development of an intuitive functional-
style query system that incorporated a variety of similarity functions53 capable of generating
data relationships not conceived during the creation of the database. For example, one may
query for a relationship between the PFAM and the COG databases, although this relation
is not explicitly dened in the database. Functional-style queries are composed of a set of
atomic functions provided to the users. Each function takes as input a set of parameters
and gives as output a well-dened value or set of values. A full query is dened as a pipeline
Structural Comparison of
Functional Orthologs
\We are nding that things that once appeared to be
biologically independent are closely connected."
Peter K. Sorger
4.1 Introduction
Quantiable models of protein evolution are useful for developing robust tools to identify
suitable drug-binding sites, to predict increases in susceptibility to a human genetic disease,
and to predict and modify organismal niches. Some of the strongest arguments in favor
of biological evolution draw from studies on protein evolution using sequence homology146.
Multiple sequence alignments are routinely used to create phylogenetic relationships147,148,
which highlights sequence variability between organisms. The accepted view of protein
evolution is that changes to the proteins gene sequence are selected and modulated by a
number of factors that includes protein structures149,150.
What is the impact on protein structure as its sequence undergoes genetic drift? Main-
taining the correct protein fold is fundamental to preserving its function151, but evolving
the sequence would also be expected to result in structural changes152,153. The resulting
paradox is that sequence determines a proteins structure, but the structure is relatively
invariant over a large range of sequences. This paradox is highlighted by the tremendous
dierence between the number of known protein structures versus protein folds154. Even
though the Protein Data Bank contains 61,086 protein structures, there are only 1,110
unique topologiesy and 1,195 unique foldsz in the CATH and SCOP structure classication
databases, respectively. The signicant reduction in the number of protein folds relative to
the number of protein sequences implies a much stronger correlation between structure and
function. Protein structures are generally viewed as more conserved relative to its sequence.
Evolution rates of protein sequences and structures were quantied by Illergard et al. 155 .
An intuitive explanation is that substitutions of residues with similar chemical proper-
ties are relatively frequent, but that many mutations are silent, that is, they do not lead to
any signicant structural modications. However, the explicit reason for the reduction in
fold space remains unclear although some have suggested that the protein fold space is more
likely to be represented as a continuum instead of a collection of discreet folds156. Using
a continuous representation the fold space, a protein fold should be considered as being
plastic, where sequence changes are accommodated by local perturbations in the structure
while maintaining the general characteristics of a particular fold155,157,158. Correspondingly,
the genetic drift in a proteins sequence may imply a similar gradual divergence in structure
instead of a sudden dramatic transition to a new fold. If this perspective is accurate, then
a comparative analysis of homologous proteins should identify correlated rates of structure
and sequence divergence. Previous studies have looked at homologous structure similarity
before but the datasets did not try to show phylogenetic consequences of structure diver-
gence155,157,158. To help understand how protein plasticity aects organism divergence we
compared 48 sets of homologous protein families annotated in the COG database for two
bacterial phyla, Firmicutes and Proteobacteria.
As of October 29, 2009
yAs of version 3.2.0 released in August 2008
zAs of version 1.75 released in February 2009
.
S
T
R
U
C
T
U
R
A
L
C
O
M
P
A
R
IS
O
N
O
F
F
U
N
C
T
IO
N
A
L
O
R
T
H
O
L
O
G
S
7
0
Table 4.1 (continued): The 48 COG structure Families with two Firmicutes and two Proteobacteria organisms after manual curation
COG Function Tree CATH
366 Glycosidases Starburst 2.60.40.1180
454 Histone acetyltransferase HPA2 and related acetyltransferases Starburst 3.40.630.30
491 Zn-dependent hydrolases, including glyoxylases Starburst 3.60.15.10
500 SAM-dependent methyltransferases Starburst 3.40.50.150
526 Thiol-disulde isomerase and thioredoxins Starburst 3.40.30.10
590 Cytosine/adenosine deaminases Starburst 3.40.140.10
637 Predicted phosphatase/phosphohexomutase Starburst 1.10.164.10
664 cAMP-binding proteins Starburst 1.10.10.10
745
Response regulators consisting of a CheY-like receiver domain and
a winged-helix DNA-binding domain
Starburst 3.40.50.2300
753 Catalase Starburst 3.30.63.10
778 Nitroreductase Starburst 3.40.109.10
784 FOG: CheY-like receiver Starburst 3.40.50.2300
796 Glutamate racemase Starburst 3.40.50.1860
1028 Dehydrogenases with dierent specicities Starburst 3.40.50.720
1151 6Fe-6S prismane cluster-containing protein Starburst 1.20.1270.30
1309 Transcriptional regulator Starburst 1.10.357.10
1396 Predicted transcriptional regulators Starburst 1.10.260.40
1404 Subtilisin-like serine proteases Starburst 3.40.50.200
1733 Predicted transcriptional regulators Starburst 1.10.510.10
1846 Transcriptional regulators Starburst 1.10.10.10
2159 Predicted metal-dependent hydrolase of the TIM-barrel fold Starburst 3.20.20.140
2367 Beta-lactamase class A Starburst 3.40.710.10
2730 Endoglucanase Starburst 3.20.20.80
3693 Beta-1,4-xylanase Starburst 3.20.20.80
4948 L-alanine-DL-glutamate epimerase and related enzymes of enolase superfamily Starburst 3.20.20.120
The protein structures in COG 28 (thiamine pyrophosphate requiring enzymes) pro-
vides a useful example of the structural divergence that occurred after the Firmicutes and
Proteobacteria phyla split. The overall fold is conserved between the phyla while there are
discrete localized structural elements that are unique to each phylum. The two Firmicutes
structures (Figure 4.3A) yield a Z-score of 59.6 and an FSS of 0.83, indicating very high
structural conservation. The four Proteobacteria structures (Figure 4.3B) yield an average
Z-score of 37:71:6 and an average FSS of 0:580:03. Again, the structures share a similar
fold despite the slightly lower scores.
The major structural dierences between the Firmicutes and Proteobacteria are high-
lighted in red on a representative Firmicutes (Figure 4.3C) structure from L. plantarum
(PDB ID: 1POW) and a representative Proteobacteria structure (Figure 4.3D) from P.
u-
orescens (PDB ID: 2AG0). The comparison of protein structures between phyla yields an
average Z-score of 34:8 1:2 and an average FSS of 0:49 0:02, which is signicantly lower
than the comparisons within each phylum. This suggests a divergence in structural details
while conserving the overall fold. A detailed analysis reveals localized dierences between
the structures from the two phyla. In the Firmicutes representative structure, there is a con-
tinuous helix compared to helical breaks and loop insertions in the Proteobacteria structure.
This is similar to the C-terminal domain of primase, where a long continuous helix found
in the E. coli structure is broken by a loop region in B. stearothermophilus160,161,162,163.
4.4 Phylogenetic Analysis of Functional Orthologs
4.4.1 Methods
Additionally to pairwise structural alignment, all the protein structures from each COG
were simultaneously aligned using the multiple structure alignment program MAMMOTH-
multi 82. We implemented a script utilizing the resulting aligned structures and the structure-
based sequence alignment to calculate an all-versus-all matrix of per-residue -Carbon (C)
distances. Standard bootstrapping techniques164 were then applied to the all-versus-all ma-
trix of per-residue C distances to generate 100 distance-matrices. Columns of structure-
based sequence alignments with the corresponding C distances were randomly selected
Software available at http://ub.cbm.uam.es/mammoth/mult/
until the total number of columns in the original sequence alignment was reached. The re-
sulting set of C distances were then used to calculate a root mean square deviation (rmsd)
between each pair of structures in the matrix. The 100 distance-matrices were imported
into PHYLIP v3.68138 to generate a consensus phylogenetic tree and bootstrap condence
levels.
Each set of 100 bootstrapped distance-matrices were analyzed by the Fitch-Margoliash
method165 implemented in PHYLIP. Each matrix was jumbled with 100 replicates. This
resulted in 10000 (= 100 100) unique and random distance matrices for each COG. The
best tree was identied with the program Consense implemented in PHYLIP using the
extended majority rule conservation. Since the bootstrapped trees do not show distance
relationship, the original distance matrix generated by MAMMOTH-multi was used to
generate a distance based phylogenetic tree. Each original distance matrix was jumbled
with 100 replicates. The distance based phylogenetic tree was drawn using the program
Drawtree implemented in PHYLIP. Each tree was visually inspected and compared with
the DaliLite analysis using the bootstrap values to determine if a tree t the star, split or
undetermined classication.
4.4.2 Results
Structure based phylogenies were created from root-mean square dierences (rmsd) in per
residue C positions for optimally aligned protein structures using MAMMOTH-multi (see
Section 2.2.2). A separate phylogenetic tree was generated for each COG, where three
distinct patterns were observed as reported in Table 4.1. Fifteen trees exhibited a strong
split at the phylum level, 29 exhibited a starburst pattern suggesting little to no evidence
for a split according to phyla, and 4 exhibited a strong split at the phylum level but with
the exception of a single structure (split +1).
The fteen COG phylogenies with strong phylum-splitting patterns had two branches,
one with closely related Firmicutes structures and the other with closely related Proteobac-
teria structures. Figure 4.4a shows COG28 (Thiamine pyrophosphate requiring enzymes)
and COG446 (Uncharacterized NAD(FAD)-dependent dehydrogenases) as examples. The
FSS =
Avg(FSS+= )
Avg(FSS+=+)
2 +
Avg(FSS = )
2
(4.1)
Similarly, the sequence identity ratio SeqID (Equation4.2) was determined by calcu-
lating an average sequence identity for the Proteobacteria-Firmicutes structure compar-
isons, and dividing by the sum of the average Proteobacteria-Proteobacteria and Firmicutes-
Firmicutes comparisons.
SeqID =
Avg(SeqID+= )
Avg(SeqID+=+)
2 +
Avg(SeqID = )
2
(4.2)
In general, most starburst phylogenetic trees (see Figure 4.4b) had a branch length
between members of dierent phyla that was much shorter than the branch lengths between
members within the same phyla. That is, a starburst phylogeny was expected to have FSS
and SeqID values greater than one. Likewise, most split phylogenies had longer branches
between phyla than within each phyla (see Figure 4.4a) and were expected to yield FSS
and SeqID of less than one.
Figure 4.5: The relationship between structure and sequence change was constant regard-
less of the phylogenetic starburst () or split () pattern. The best-t line is dened by
FSS = 0:55SeqID + 0:45 and yields a correlation factor R2 = 0:7.
represented by the starburst phylogenetic tree pattern. The remaining 12 COGs correspond
to 11 splits and 1 split +1 phylogenetic tree patterns.
The second most populous CATH family is CATH 1.10 (mainly , orthogonal bundle)
with 15% of our COGs belonging to this CATH family. Most (85.7%) of the COGs (6 of 7)
in the CATH 1.10 family are represented by the starburst phylogenetic tree pattern with
only one COG represented by a split pattern. There appears to be a limit in structure
similarity at approximately 0.6 FSS and a corresponding sequence identity limit at 40% for
CATH 1.10 (). This limit is not observed in the CATH 3.40 family (). The sequence
and structure similarity limit for CATH 1.10 combined with a larger percentage of COGs
assigned to the starburst family suggests that CATH 1.10 is more susceptible to mutations
that aect the protein structure. The results suggest a faster evolutionary rate leading to
a higher structural divergence relative to other CATH architectures.
4.7 Discussion
The comparison of homologous protein structures with the same function provides quan-
titative evidence that protein structures diverged following the speciation events that cre-
ated the modern bacterial phyla of Firmicutes and Proteobacteria. The abrupt cuto at
61% sequence identity and 0.84 fractional structure similarity observed between Firmi-
cutes and Proteobacteria proteins was mirrored by an approximate 60% protein sequence
identity between these two phyla observed by 16S rRNA sequence similarity167,168. Thus,
this maximum observed sequence identity imparts limits to the maximum possible struc-
ture similarity between homologus proteins from these two phyla. This is consistent with
prior observations that sequence identity 40-50% sometimes results in signicant structural
and functional dierences152,153,169. Furthermore, the results imply an inherent allowable
structural plasticity that does not perturb function. Additionally, the random drift af-
ter speciation inexorably leads to non-identical structures despite maintenance of function.
There are a number of cases where FSS was below 0.20 indicating a signicant structural
change. Proteins with completely dierent folds but the same function are extreme ex-
amples of the plasticity of the structure-function relationship and include such proteins as
peptidyl-tRNA hydrolases170, pantothenate kinase171, polypeptide release factors172 and
lysyl-tRNA synthetases173.
Forty percent of the COGs we examined have evolved slowly enough that it was pos-
sible to generate phylogenetic trees consistent with this ancient split. The other COGs
have either evolved too rapidly or are otherwise subject to few evolutionary constraints
to provide evidence for this split. This distinction between the COGs is clearly apparent
from the comparison of FSS and SeqID from Figure 4.5. The linear relationship implies
a xed relative structure drift rate, where structure changes half as fast as sequence across
phyla. This correlation in the divergence of protein sequences and protein structures has
additional ramications beyond bacterial evolution. Our analysis implies a continuum of
protein folds that adapt to large sequence changes by incurring local structural modica-
tions155,156,157,158. This continuum of protein folds makes it challenging to apply protein
structural classication to identify function, as has been previously noted174,175.
Does the nature of the proteins three-dimensional structure play a role in protein struc-
ture divergence? Our analysis demonstrates that some proteins evolve slowly and maintain
high sequence identity (80%) and structure similarity (0.80 FSS) while other proteins
exhibit rapid evolution rates where the sequence identity is lower than 20% and the FSS
below 0.40. This implies that the underlying architecture of a particular protein may be
more or less amenable to amino-acid substitutions in order to maintain functional activity.
A specic protein fold may have a higher intrinsic plasticity that enables it to readily ac-
commodate sequence changes through local conformational changes without a detrimental
impact on activity. This is exactly what was observed, structural variations were localized
to specic regions as illustrated by the comparison of the COG 28 protein structures see
(Figure 4.3). This is consistent with the observation that there are dierent structure di-
vergence rates within a protein176,177. Regions of the protein that do not impact biological
activity are expected to yield a higher divergence rate and incur larger local structural
changes152,178. As a result, a fold with a relatively high plasticity would experience an
elevated structural diversity between phyla, where the rate of change may closely parallel
the mutation rate155. Conversely, another fold may be extremely sensitive to amino-acid
substitutions, where minor sequence perturbations may result in a decrease in structural
integrity and a corresponding loss of activity. As a result, the sequence and structure of this
Experimental Datasets
\It is a capital mistake to theorize before one has data."
Sir Arthur Conan Doyle
In order to experimentally test and evaluate the various classication methods proposed
in Chapters 6, 7, 8 and 9, we used ve dierent data sets, namely CRCARS (Section 5.1),
FLU (Section 5.2), HDD (Section 5.3), PBC (Section 5.4), TCEQ (Section 5.5). Table 5.1
summarizes the utilization of the datasets during the experimental evalution of the proposed
methods. These datasets will also be used through the remaining of this dissertation to
explain the classication methods and illustrate the successive steps of the algorithms.
During our experiments, we used ve dierent data sets: CRCARS (Section 5.1), FLU
(Section 5.2), HDD (Section 5.3), PBC (Section 5.4), TCEQ (Section 5.5). These data sets
are based on real data (non-synthetical) and cover broad knowledge domains, in particular
Table 5.1: Usage of the datasets during the experimental evaluation of the proposed
methods.
Method CRCARS FLU HDD PBC TCEQ
Classication Integration
Reclassication
Temporal Classication
Table 5.2: Main characteristics of the CRCARS, FLU, HDD, PBC and TCEQ data sets
used during our experiments. The data sets are based on real, non-synthetical, data.
Dataset Domain Records Features Classes Temporal Ref.
CRCars Automobile 406 5 3/3 No 181
Google Flu Trends Medical 15,352 4 2 Yes 182
Heart Disease Diagnostic Medical 720 10 2 No 183
Primary Biliary Cirrhosis Medical 314 17 3/2/4 No 184
TCEQ Ozone Meteorological 2,534 40 2 Yes 185
Datasets were remapped during the experiments. As a result, the number of actual records, features
and classes used during the experiments may vary from the above numbers. See relevant sections for details.
the automobile (CRCARS), the meteorological (TCEQ) and the medical (HDD, FLU and
PBC) elds. Unlike FLU and TCEQ, the CRCARS, PBC and HDD databases contain static
data, that is, data that do no vary over time. It should be noted that the PBC, HDD and
the TCEQ databases contain a relatively large number of features used by the classication
algorithms. The main characteristics of the databases are summarized in Table 5.2.
5.1 CRCars Database
The CRCars database was used during the Second Exposition of Statistical Graphics Tech-
nology organized by the Committee on Statistical Graphics of the American Statistical
Association in 1983. This data set was collected by Donoho et al.181 in 1982. It comprises
406 observations on the following 7 measurements:
acceleration: from 0 to 60 mph, measured in seconds (between 8 and 24.8 seconds)
number of cylinders of the engine (between 3 and 8 cylinders),
engine displacement in cubic inches (between 68 and 455 cubic inches),
horsepower: power of the engine (between 46 and 230 horsepower),
vehicle weight, measured in pounds (between 732 and 5140 lbs.),
mpg: fuel eciency (between 8.0 and 24.6 miles per gallon),
origin of car (American, European or Japanese).
Because of its historical nature, this data set is particularly suitable for temporal classi-
cation. Hence we used it to evaluate the proposed Temporal linear classication discussed
in Chapter 9.
5.3 Heart Disease Databases
The heart disease diagnosis data set183 describes the medical records of patients regarding
their heart condition and was collected by Robert Detrano, Andras Janosi and William
Steinbrunn from the following three hospitals:
Cleveland Clinic Foundation, USA (303 patients),
Institute of Cardiology, Budapest, Hungary (294 patients),
University Hospital of Zurich, Switzerland (123 patients).
The dataset contains no missing value. Each hospital records the following ten features:
1. age: years
2. gender: (0 = female, 1 = male)
3. cp: chest pain (1=typical angina, 2=atypical, 3=non-anginal, 4=asymptomatic)
4. trestbps: resting blood pressure (in mmHg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. restecg: resting electrocardiographic results
7. thalach: maximum heart rate achieved
8. exang: exercise induced angina (0 = no, 1 = yes)
9. oldpeak: ST depression induced by exercise relative to rest
10. disease: presence of heart disease (-1 = not present, 1 = present)
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



