Sign up & Download
Sign in

Computational Discovery of Gene Modules, Regulatory Networks and Expression Programs

by Georg Kurt Gerber
Science (2007)

Cite this document (BETA)

Available from Georg Gerber's profile on Mendeley.
Page 1
hidden

Computational Discovery of Gene Modules, Regulatory Networks and Expression Programs

Computational Discovery of Gene Modules,
Regulatory Networks and Expression Programs
by
Georg Kurt Gerber
B.A., Mathematics, UC Berkeley (1991)
M.P.H., Infectious Diseases, UC Berkeley (1994)
S.M., Electrical Engineering and Computer Science, MIT (2003)
Submitted to the Harvard-MIT Division of Health Sciences and
Technology
in partial ful llment of the requirements for the degree of
Doctor of Philosophy in Computer Science and Medical Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2007
c
Massachusetts Institute of Technology 2007. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Harvard-MIT Division of Health Sciences and Technology
July 1, 2007
Certi ed by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
David K. Gi ord
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Martha L. Gray
Edward Hood Taplin Professor of Medical and Electrical Engineering
Co-Director, Harvard-MIT Division of Health Sciences and
Technology
Page 3
hidden
Computational Discovery of Gene Modules, Regulatory
Networks and Expression Programs
by
Georg Kurt Gerber
Submitted to the Harvard-MIT Division of Health Sciences and Technology
on July 1, 2007, in partial ful llment of the
requirements for the degree of
Doctor of Philosophy in Computer Science and Medical Engineering
Abstract
High-throughput molecular data are revolutionizing biology by providing massive
amounts of information about gene expression and regulation. Such information
is applicable both to furthering our understanding of fundamental biology and to
developing new diagnostic and treatment approaches for diseases. However, novel
mathematical methods are needed for extracting biological knowledge from high-
dimensional, complex and noisy data sources. In this thesis, I develop and apply
three novel computational approaches for this task. The common theme of these
approaches is that they seek to discover meaningful groups of genes, which confer ro-
bustness to noise and compress complex information into interpretable models. I rst
present the GRAM algorithm, which fuses information from genome-wide expression
and in vivo transcription factor-DNA binding data to discover regulatory networks of
gene modules. I use the GRAM algorithm to discover regulatory networks in Saccha-
romyces cerevisiae, including rich media, rapamycin, and cell-cycle module networks.
I use functional annotation databases, independent biological experiments and DNA-
motif information to validate the discovered networks, and to show that they yield
new biological insights. Second, I present GeneProgram, a framework based on Hi-
erarchical Dirichlet Processes, which uses large compendia of mammalian expression
data to simultaneously organize genes into overlapping programs and tissues into
groups to produce maps of expression programs. I demonstrate that GeneProgram
outperforms several popular analysis methods, and using mouse and human expres-
sion data, show that it automatically constructs a comprehensive, body-wide map
of inter-species expression programs. Finally, I present an extension of GenePro-
gram that models temporal dynamics. I apply the algorithm to a compendium of
short time-series gene expression experiments in which human cells were exposed to
various infectious agents. I show that discovered expression programs exhibit tem-
poral pattern usage di erences corresponding to classes of host cells and infectious
agents, and describe several programs that implicate surprising signaling pathways
and receptor types in human responses to infection.
Thesis Supervisor: David K. Gi ord
Title: Professor of Electrical Engineering and Computer Science
3
Page 5
hidden
Biographical information
Personal
Born: December 31, 1970 in Los Angeles, CA.
Education
B.A., Mathematics, summa cum laude, UC Berkeley, 1991.
M.P.H., Infectious Diseases, UC Berkeley, 1994.
S.M., Electrical Engineering and Computer Science, MIT, 2003.
Professional experience
 Graduate Research Assistant, MIT Computer Science and Arti cial Intelligence
Laboratory (Supervisor: Prof. David Gi ord), 2000{present.
 Instructor, Northeastern University Bioinformatics Essentials Graduate Certi -
cate Program, 2002.
 Chief Technology Ocer, IS?TV, Inc., 2000.
 Senior Vice President, L-Squared Entertainment, Inc., 1995{1999.
 President and Chief Executive Ocer, In nite Interactions, Inc., 1993{1995.
 Graduate Research Assistant, UC Berkeley Department of Computer Science
(Supervisor: Prof. Brian Barsky), 1993{1994.
 Research Associate/Laboratory Automations Specialist, Metabolex, Inc., 1992.
 Undergraduate Teaching Assistant, UC Berkeley Department of Mathematics,
1991.
 Undergraduate Research Assistant, UC Berkeley Department of Biophysics (Su-
pervisor: Prof. Hans Bremermann), 1990.
 President, Speak Easy Communications System, 1984{1989.
Academic awards and honors
 National Defense Science and Engineering Graduate (NDSEG) Fellowship, 2002{
2005.
 Sigma Xi honor society of science and engineering, Associate Member, elected
2003.
 National Science Foundation (NSF) Fellowship (awarded), 2002.
5
Page 6
hidden
 University of California Regents' Graduate Fellowship, 1992{1994.
 UC Berkeley Highest Distinction in General Scholarship in Mathematics, 1991.
 Phi Beta Kappa honor society, elected 1990.
 University of California Regents' Undergraduate Scholarship, 1989{1991.
 UC Berkeley Dean's List, every semester attended, 1989{1991.
 J&M Seitz Scholarship, 1989.
6
Page 7
hidden
Acknowledgments
\It was cold consolation to think that I, who looked upon it
with my eyes and fondled it with my ten
esh-and-bone ngers,
was no less monstrous than the book. I felt it was a nightmare
thing, an obscene thing, and that it de led and corrupted reality.
I considered re, but I feared that the burning of an in nite book
might be similarly in nite, and su ocate the planet in smoke."
|Jorge Luis Borges, The Book of Sand
In scientific papers, we boil our work down into gures, tables and equations.
At the end of most papers, there is a tiny section, usually one to two sentences,
called \acknowledgements." It's unfortunate that section is so short, because it often
contains the keys to the real story behind the science|a story about people. In
the seven years I've spent at MIT and Harvard, I've come to realize that science
ultimately springs from the incredibly rich and complex interactions we have with
colleagues, friends, family and society at large. So, in this section I will try to reveal
a little of the real story behind this thesis.
However, I must say upfront that the origins of the most mathematical portions of
this work remain more mysterious to me. On those occasions, when traces of in nity
shook loose from equations, I credit primarily twilight, the lizards, and the way that
light in the mid-summer late in the afternoon sometimes spreads across the bare walls
of old houses.
First, I must acknowledge the most direct collaborators on this work. Ziv Bar-
Joseph co-developed the GRAM algorithm with me when he was a fellow graduate
student at MIT. Ziv has been more than just a fantastic intellectual collaborator; he
has been a true friend. Robin Dowell, a post-doc in the Gi ord lab, was also a great
collaborator. She worked with me on various aspects of GeneProgram, particularly
the cross-species analysis. Robin is not only a talented researcher, but she's also a
gifted conversationalist. Over the last two years, I don't know how many hours we've
spent talking about everything from molecular evolution to car engines.
7
Page 8
hidden
Next, I must acknowledge the members of my thesis committee, who have guided
my work for almost seven years. Professor David Gi ord, my thesis supervisor, intro-
duced me to the eld of computational functional genomics and has been instrumental
in encouraging my research by providing ideas, resources, and guidance while also fos-
tering the critical collaborative relationships outside of our immediate group. I think
that part of the reason that Professor Gi ord and I have worked well together is
that we share an entrepreneurial spirit. He is always starting new projects, and in-
vestigating novel technologies, research ideas and collaborations. Moreover, he's not
afraid to step outside the boundaries of traditional academic disciplines, and he's
been particularly encouraging of my pursuit of a medical degree simultaneously with
my Ph.D.
Professor Tommi Jaakkola introduced me to many topics in machine learning and
statistics, and suggested a number of the core computational ideas in my research.
He's one of the smartest people I've ever encountered, and his deep knowledge of his
eld is truly inspirational. Also, Professor Jaakkola is one of the nicest and most
encouraging faculty members I've ever met. Despite his busy schedule, he always
made time to talk and always treated me as a collaborator. Even when research
wasn't going well, Professor Jaakkola never failed to remind me of the true purpose
of the academic enterprise|teaching, learning and developing new ideas.
Professor Richard Young has provided much inspiration for the biological parts
of this thesis. His passion for biology and unlocking the secrets of genetic regulation
is infectious. Professor Young has also been critical in fostering true collaborations
between biologists in his lab and computer scientists in my group.
I must also acknowledge a number of graduate students and post-docs in Professor
Gi ord, Jaakkola, and Young's labs. John Barnett, Tim Danford, Tong Lee, Chris
Reeder, Kenzie MacIsaac, Reina Riemann, Alex Rolfe, Alan Qi and Nicola Rinaldi
have let me bounce ideas o them, have provided critical comments on manuscripts
and presentations, and have been great all-around friends!
Finally, I must acknowledge my family. My brother Karl was my constant com-
panion throughout childhood and it was with him that I set out one September, long
ago, for Los Angeles Valley College, where I think the whole idea about getting a
Ph.D. got started. My sister Margot supposedly rst suggested the Valley College
plan, so she also shares some part in the saga. Then, of course, there are my parents.
My father Barry rst introduced me to computers, possibly as early as 1974. Some
of my rst memories are of my father, brother and me playing Wumpus together on
an old terminal. Undoubtedly, it was those experiences that sparked my interest in
computers and programming. As far as I can tell, most of what my father knows
he's taught himself, and I must credit him for passing on his drive for learning inde-
pendently, which has served me well as a researcher. My other earliest memories are
of my mother reading to me about dinosaurs, reptiles and pre-historic cultures. To
her, it is the world of people and other living things that matters most, and I think
she always knew I would end up studying biology and medicine someday. She has
supported every step of my academic odyssey from the baked bungalows of Valley
College, to the misty hills of Berkeley, to the In nite Corridor of MIT, to the stone
quad of Harvard Medical School.
8
Page 9
hidden
Finally, I must acknowledge my wife, Anne-Marie. I fell in love with this smart,
somewhat o beat, pretty, brown-haired, blue-eyed girl more than a decade ago. Since
then, she has shared my long, strange journey from Hollywood executive to compu-
tational biology researcher and medical student. Although this path has not always
been an easy one to follow, we have traveled it together, with hands joined, and only
in that way has it been possible to make it out of the labyrinth.
9
Page 11
hidden
To autumn and all that November may bring,
and to my wife, Anne-Marie, who faces that
wondrous mystery with me every day.
11
Page 17
hidden
List of Figures
1-1 Summary of the ChIP-chip experimental protocol. . . . . . . . . . . . 31
2-1 Pseudo-code for the GRAM algorithm . . . . . . . . . . . . . . . . . 41
2-2 GRAM rich media gene modules network. . . . . . . . . . . . . . . . 45
2-3 The GRAM algorithm improves the quality of DNA-binding informa-
tion by using expression data. . . . . . . . . . . . . . . . . . . . . . . 49
2-4 The GRAM algorithm assigns di erent regulators to genes with simi-
lar expression patterns that cannot be distinguished using expression
clustering methods alone. . . . . . . . . . . . . . . . . . . . . . . . . . 50
2-5 Motif enrichment for GRAM modules. . . . . . . . . . . . . . . . . . 52
2-6 GRAM algorithm modes of combinatorial regulation. . . . . . . . . . 56
2-7 Rapamycin gene modules network. . . . . . . . . . . . . . . . . . . . 59
2-8 The cell-cycle dynamic regulatory network. . . . . . . . . . . . . . . . 62
2-9 Comparison of results of the GRAM algorithm to those of Ihmels et al. 64
2-10 Comparison of the results of the GRAM algorithm to those of Pilpel
et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3-1 Putative origin of the Dirichlet Process. . . . . . . . . . . . . . . . . . 72
3-2 Graphical model depiction of a nite mixture model. . . . . . . . . . 73
3-3 Graphical model depiction of the in nite mixture model. . . . . . . . 75
3-4 Graphical model depiction of the Hierarchical Dirichlet Process. . . . 77
3-5 The basic MCMC sampling scheme for in nite mixture models. . . . 78
3-6 The basic MCMC sampling scheme for Hierarchical Dirichlet Processes. 80
3-7 GeneProgram algorithm steps overview. . . . . . . . . . . . . . . . . . 83
3-8 Conceptual overview of the data generation process for gene expression
in mammalian tissues. . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3-9 GeneProgram probability model overview. . . . . . . . . . . . . . . . 87
3-10 Graphical model depiction of the GeneProgram model. . . . . . . . . 89
3-11 Synthetic data experiments using GeneProgram and several other well-
known expression analysis methods. . . . . . . . . . . . . . . . . . . . 98
17
Page 18
hidden
3-12 GeneProgram discovered 19 consensus tissue groups using gene expres-
sion data from 140 human and mouse tissue samples. . . . . . . . . . 101
3-13 Correspondence plots comparing GeneProgram to biclustering algo-
rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3-14 Automatic inference of tissue groups improves cross-validation perfor-
mance of the GeneProgram model. . . . . . . . . . . . . . . . . . . . 105
3-15 The generality score organized expression programs discovered by Gene-
Program into a comprehensive body-wide map. . . . . . . . . . . . . 110
4-1 GeneProgram++ probability model overview. . . . . . . . . . . . . . 116
4-2 Graphical model depiction of a simpli ed version of GeneProgram++
without tissue groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4-3 Graphical model depiction of the full GeneProgram++ model. . . . . 120
4-4 Correspondence plots comparing GeneProgram++ to biclustering al-
gorithms in terms of gene set consistency. . . . . . . . . . . . . . . . . 127
4-5 Schematic of six pre-de ned temporal patterns used in analyzing the
infection time-series compendium. . . . . . . . . . . . . . . . . . . . . 131
4-6 Examples of genes with the six temporal expression patterns used for
data analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4-7 Summary of expression programs (EPs) 1{25 discovered by GenePro-
gram++ in the infection data. . . . . . . . . . . . . . . . . . . . . . . 135
4-8 Summary of expression programs (EPs) 26{50 discovered by GenePro-
gram++ in the infection data. . . . . . . . . . . . . . . . . . . . . . . 137
4-9 Summary of expression programs (EPs) 51{75 discovered by GenePro-
gram++ in the infection data. . . . . . . . . . . . . . . . . . . . . . . 139
4-10 Summary of expression programs (EPs) 76{104 discovered by Gene-
Program++ in the infection data. . . . . . . . . . . . . . . . . . . . . 141
4-11 The Wnt signaling pathway and overlapping genes in EPs 50 and 84. 143
4-12 The Toll-like receptor signaling pathway and overlapping genes in EP
99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
18
Page 19
hidden
List of Tables
2.1 Previous evidence for regulator-regulator interactions predicted by GRAM
modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 The GRAM algorithm can improve the true-positive binding rate with-
out increasing the false-positive binding rate. . . . . . . . . . . . . . . 50
2.3 Activators identi ed by the GRAM algorithm. . . . . . . . . . . . . . 53
2.4 Repressors identi ed by the GRAM algorithm. . . . . . . . . . . . . . 54
3.1 Summary of random variables in the GeneProgram model. . . . . . . 88
3.2 Tissue population means for synthetic data. . . . . . . . . . . . . . . 95
3.3 Comparison of GeneProgram to biclustering algorithms for recovery of
biologically interpretable gene sets. . . . . . . . . . . . . . . . . . . . 103
4.1 Summary of random variables used in the simpli ed GeneProgram++
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.2 Summary of random variables used in the full GeneProgram++ model. 121
4.3 Comparison of GeneProgram++ to biclustering algorithms for recovery
of biologically interpretable gene sets. . . . . . . . . . . . . . . . . . . 126
4.4 Summary of infection time-series data sets analyzed. . . . . . . . . . 129
4.5 Overview of NF-B proteins. . . . . . . . . . . . . . . . . . . . . . . . 130
19
Page 22
hidden
in the Celestial Emporium of Benevolent Knowledge, not all groupings are good ones.
The Greek philosopher Aristotle (384-322 BC) developed fairly sophisticated ideas
about what constitutes good categories. Chief among his contributions in this eld
is the notion that a collection of traits, rather than a single characteristic, should be
used to di erentiate members of a group. In his work, Historia Animalium, he wrote:
\The method then that we must adopt is to attempt to recognize the natu-
ral groups, following the indications a orded by the instincts of mankind,
which led them for instance to form the class of Birds and the class of
Fishes, each of which groups combines a multitude of di erentiae, and is
not de ned by a single one as in dichotomy [78]."
Of course, Aristotle didn't quite gure everything out, and subsequent genera-
tions of philosophers and scientists have expanded on his ideas. Three important
modern criteria for evaluating the inherent meaningfulness of groups are based on
evolutionary, mathematical and cognitive perspectives.
From the evolutionary perspective, a system|from an individual cell to entire
ecologies|can be divided into modules, which are relatively independent \building-
blocks" comprised of functionally related components [37]. The inherent decompos-
ability of a system into modules is best justi ed from the point of view of evolutionary
tness. Nobel Prize winner and complexity theorist Herbert Simon summarized the
argument for modularity eloquently:
\But if the e ectiveness of design of each organ depends on the design
of the organs with which it interacts, then there is no guarantee that
improvements of one organ will not worsen the performance of others : : :
Suppose, instead, that the e ectiveness of each organ depends very little
on the design of others, provided that the inputs each requires are supplied
by the others. Then, up to a scale factor, the design of each organ can
be improved independently of what is happening in the others; and it is
easy to show that tness will rise much more rapidly than when there is
mutual dependence of design [37]."
From the mathematical perspective, good groupings of data provide compression
of information without loss of important details. Mathematical techniques for com-
pressing information enable many modern2 technologies including cell-phones and
digital video [47]. The basic idea behind many compression techniques is to assign
data items to groups, and then to replace individual items with statistics derived from
the groups. In compression algorithms, one must generally trade-o the competing
objectives of making the representation of the compressed data as small as possible
with retaining as much information as possible [212]. From the practical standpoint,
a compressed data set can serve as a useful summary of the original data. Further,
if there is noise in data, compression can result in a smoother, more representative
view of the information.
2Well, these are modern as of 2007!
22
Page 26
hidden
analysis of the results . . . Computational biology will perform a critical
and expanding role in this area: whereas structural genomics has been
characterized by data management, functional genomics will be charac-
terized by mining the data sets for particularly valuable information."
One of the most valuable tools in functional genomics is the DNA microarray,
which allows researchers to measure the expression levels of thousands of mRNAs
simultaneously. Although there are a variety of DNA microarray technologies in use,
all operate on the basic principle that complementary strands of nucleic acids can
\recognize" or hybridize with each other with high speci city through base-pairing.
DNA microarrays consist of a large number of di erent DNA molecules, called
\probes," which are immobilized on a solid surface in an ordered matrix. The posi-
tions of the DNA molecules on the array are predetermined, and the di erent arrayed
pieces of DNA usually correspond to portions of genes or inter-genic regions of inter-
est. For genome-wide expression analysis, messenger RNA from a group of cells is
worked up to produce a sample of nucleic acid \targets," some of which will hybridize
with the DNA probes on the microarray [187]. Various methods, usually involving
either
uorescence or radioactivity, can be used to detect the degree of hybridization
that occurs and thus quantify the expression levels of mRNAs present. The two most
popular DNA microarray technologies di er principally in the probes or arrayed ma-
terial used: either short sequences of DNA (oligonucleotides) [126] or complementary
DNA (cDNA) [56, 43].
1.1.3 Oligonucleotide microarrays
Oligonucleotide arrays use as probes short DNA sequences, typically of 10-60 nu-
cleotides, which are synthesized directly on microarray slides. The three major
technologies for production of oligonucleotide microarrays involve photolithography,
miniature mirrors, or ink-jet printing.
One of the main producers of oligonucleotide arrays is the company A ymetrix,
Inc., which uses a photolithographic process borrowed from semiconductor manu-
facturing [3]. A ymetrix's microarray manufacturing process begins with a quartz
wafer that is coated with silane molecules, forming a uniform matrix. Nucleotides
are then attached to this matrix via linker molecules. A photolithographic mask
determines what area of the wafer will receive nucleotides. The mask has small
windows, and when ultraviolet light is shined through these openings, the exposed
linkers become chemically \de-protected," allowing coupling to added nucleotides.
The added nucleotides also have chemical groups at a particular position that are re-
moved when exposed to ultraviolet light. By using a sequence of di erent masks, and
adding nucleotides in the appropriate order, a large number of oligonucleotides of cho-
sen sequences may be synthesized directly on the quartz wafer. Current A ymetrix
GeneChips c
have over 1.3 million spatially distinct \features" (oligonucleotides of
di erent sequences) on a single array. Each oligonucleotide probe is typically 25 bases
long (a 25mer). For expression analysis applications, approximately twenty probes
are carefully chosen to represent a given gene transcript. Computational and em-
26
Page 28
hidden
any source may be used. Specialized arrays have been constructed with cDNAs from
human lymphocytes and other tissues, fruit
ies, yeast, bacteria, and many other
sources [7, 220, 202, 63].
As with the procedure for expression analysis using oligonucleotide arrays de-
scribed above, the target material to be hybridized to a cDNA array consists of a
sample of mRNA prepared from a population of cells. However, in the case of cDNA
arrays, labelling is done during the reverse transcription to cDNA step. The in vitro
transcription step is skipped, and the cDNA is hybridized directly to the microarray.
These is another important di erence in target preparation that is done to com-
pensate for the somewhat inexact printing process. Because inconsistent amounts of
cDNA can be deposited at a given spot, direct comparisons across di erent arrays may
be inaccurate. Thus, a method of competitive hybridization is used in which two sep-
arate samples labelled with di erent
uorescent dyes are simultaneously hybridized
to the array.
The remaining laser excitation and scanning processes for cDNA arrays are also
similar to those used for oligonucleotide arrays. One di erence is that two lasers of
di erent wavelengths are used, so that the intensity of
uorescence of the two samples
can be measured separately. A ratio is then reported, giving the fold di erence in
expression between the two samples. Thus, with this technique expression levels are
always reported as ratios relative to one sample.
1.1.5 Comparison of DNA microarray technologies
All the microarray technologies described above have been applied to a variety of bi-
ological research problems [56, 43, 3, 154, 5]. In the past, cDNA microarrays were the
dominant technology. Knowledge of a gene's DNA sequence was not required, longer
probes were presumed to lead to less cross-hybridization, and they were cheaper to
produce. However, with the sequencing of many biologically important organisms,
the development of oligonucleotide arrays using longer probes or mismatch pairs, and
cost reductions in oligonucleotide array manufacture, these advantages have become
less important. Thus, recent experimental studies now most commonly employ com-
mercially manufactured oligonucleotide arrays.
Each microarray manufacturer claims various technological advantages. Agilent
argues that their technology is accurate, very
exible for producing custom arrays,
and the most cost e ective. A ymetrix claims to make the best quality products,
because of the superiority of their manufacturing process and more extensive expe-
rience producing microarrays. NimbleGen produces the highest density microarrays,
and argues that their \virtual mask" technology allows them to make very high qual-
ity custom arrays more cost e ectively than can A ymetrix. At this time, it is unclear
which, if any, of these technologies has signi cant practical advantages over the others.
1.1.6 Microarrays for measuring DNA-protein interactions
Although DNA microarrays were originally used for measuring the expression levels
of genes, they have been more recently applied to localizing DNA-protein interactions
28
Page 29
hidden
genome-wide. In this location analysis application, a chemical such as formaldehyde
is rst used to cross-link proteins to DNA in cells. Extracted DNA is then broken
into randomly sized pieces using sonication. An antibody to a selected protein is
then used to immunoprecipitate DNA fragments bound to the protein. The resulting
precipitated nucleic acids are puri ed, PCR ampli ed and
uorescently labeled to
provide the material for hybridization to a microarray for detection. Figure 1-1 shows
a cartoon of the experimental protocol.
This method of chromatin immunoprecipitation, followed by DNA microarray hy-
bridization (termed ChIP-chip), has emerged as a powerful tool for studying in vivo
genome-wide protein-DNA interactions including transcription factor binding [174,
125, 103, 199, 118, 90, 217, 122, 218, 82, 40, 177, 164], DNA replication and recombi-
nation [222, 77], and nucleosome occupancy and histone modi cation state [28, 153,
179, 144, 113, 27, 223]. Such information has been used to discover transcription
factor DNA binding motifs, to predict gene expression, and to construct large-scale
regulatory network models [82, 134, 121, 85, 21, 113, 27, 131].
A major di erence between microarrays used for ChIP-chip experiments and those
used for measuring gene expression levels is that the former typically need to array a
much larger portion of the genome. The coding regions of genes comprise a relatively
small part of the total DNA of eukaryotic organisms [203]. Because proteins may in-
teract with DNA anywhere throughout the genome, microarrays used for ChIP-chip
analysis should in principle cover the entire genome. Cost and technology limitations
drove initial ChIP-chip array designs. Early studies employed printed cDNA arrays,
with only a single 500-2,000 base-pair (bp) probe representing each intergenic re-
gion [118, 82, 113]. Improvements in array technology and reduced costs have allowed
for smaller probe lengths and more complete genome coverage. For instance, a recent
printed array design using Agilent technology covers the non-repeat portions of the
yeast or human genomes with oligonucleotide probes every few hundred bases, yield-
ing approximately 42,000 array features for yeast and 5 million for humans [164, 117].
Another recent array design uses A ymetrix microarrays with probes representing
every 20 to 35 base-pairs of sequence across human chromosomes 21 and 22 [40, 27].
1.1.7 Limitations of DNA microarrays
As with any technology, DNA microarrays have their limitations. The biggest issue
is that microarray data is extremely noisy. Indeed, some recent studies have found
microarray data to be only moderately reproducible at best [100, 173]. Noise is intro-
duced during the many steps of array manufacture, sample preparation, hybridization,
and detection [84]. For instance, pipette error, temperature
uctuations, and reagent
quality can introduce variation in mRNA ampli cation and the eciency of
uores-
cent tagging. The variability of all these factors|from chip-to-chip, laboratory-to-
laboratory, and even experimenter-to-experimenter|makes it challenging to compare
results or to quantify precisely the detection limits of microarrays.
High levels of noise are especially problematic for microarrays, because they mea-
sure the expression levels of thousands of genes simultaneously. Thus, it becomes
extremely likely by chance alone that at least some genes will have very high or
29
Page 30
hidden
Further sample
preparation
Hybridize
Fragment
Microarray
readout
Immuno-
precipitate
Crosslink
30
Page 35
hidden
79 human and 61 mouse tissues. Using this data set, we compare GeneProgram's
ability to recover biologically relevant gene sets to that of biclustering methods, and
produce a body-wide map of expression programs organized by their functional gen-
erality scores. Finally, in Section 3.6, we discuss the signi cance of our results and
comment on possible future research directions.
1.3.3 The GeneProgram++ algorithm
Problem overview
In many microarray gene expression experiments, we are interested in genes' behavior
relative to some baseline condition. For instance, we may be interested in the extent
of induction or repression of gene expression after cells are exposed to environmental
stresses [71], infected with microorganisms [202, 149, 150, 107, 33, 91, 160, 80], or
observed throughout development [220]. The simplest such studies may consider only
a few experiment-control pairs. However, in more complex studies, researchers may
seek to explore complex patterns of change, such as temporal dynamics.
In analyzing patterns of gene expression change, we would like to discover sets of
genes that behave coherently. In Chapter 3, we present the GeneProgram algorithm,
and show that it has a number of advantages over previous methods in the discovery of
biologically relevant sets of genes from large compendia of data. However, a limitation
of the GeneProgram algorithm is that it does not explicitly model patterns of gene
expression change.
There are at least three ways we could imagine extending the GeneProgram al-
gorithm to model patterns of gene expression change. To simplify the discussion, we
will describe examples in terms of induction or repression changes. First, we could
introduce an additional parameter for each gene in each expression program, indicat-
ing whether it was induced or repressed. In this scenario, expression programs would
consist of sets of genes in which each gene is consistently either induced or repressed
(but not both) in a subset of tissue samples. The problem with this approach is that
expression programs could be dicult to interpret and would not really match our
intuition for what constitutes an \atomic" or modular biological process, in which
we expect the expression of all relevant genes to coordinately change in the same
direction in response to some stimuli. This expectation suggests a second possibility
for extending the algorithm, in which we could introduce an additional parameter at
the level of the expression program, indicating whether the genes in the program were
induced or repressed. With this model, the direction of expression change for genes
in a program would be consistent and would be the same for all tissue samples using
the program. However, it is easy to imagine compendia of experiments in which an
expression program is induced in some tissue samples and repressed in others. This
then suggests the third possibility|the one GeneProgram++ uses|in which we in-
troduce a parameter for each tissue sample at the level of the expression program,
specifying the direction of expression change for genes in the program. Thus, with
this model, the direction of expression change is consistent for all genes in a program
for a particular sample.
35
Page 36
hidden
Chapter 4 organization overview
In Chapter 4, we present GeneProgram++, an extension of our original GeneProgram
algorithm that explicitly models general patterns of expression changes, including
not only induction or repression, but also temporal dynamics. Patterns of expression
change are modeled using the novel concept of program usage modi ers. A usage
modi er is a variable that is speci c to a tissue-expression program pair and describes
how a tissue uses the program. For instance, usage modi ers can specify the temporal
phase and direction (induction or repression) of expression. Thus, the genes used by
a tissue from a program and the manner in which they are expressed (e.g., early
induction versus late repression) are chosen probabilistically and in
uenced by the
behavior of similar tissues. Further, usage is by de nition consistent across a program
for a particular tissue, which facilitates biological interpretation.
After some introductory material, we begin by discussing the GeneProgram++
algorithm in detail in Section 4.1. We rst describe how a simpli ed version of
GeneProgram without tissue groups is extended to include program usage modi ers.
We then describe the full GeneProgram++ model and Markov Chain Monte Carlo
(MCMC) sampling methods for the model. In Section 4.2, we describe some addi-
tional technical improvements relating to posterior distribution summarization, and
benchmark the performance of the improved algorithm on expression data. In Sec-
tion 4.3, we apply GeneProgram++ to a compendium of 62 short time-series gene
expression experiments in which various human cell types have been exposed to dif-
ferent infectious agents or immune-modulating substances [107], and produce a map
of expression programs organized by functional generality scores. We evaluate the
biological relevance of the discovered expression programs using biological process
categories and pathway databases, as well as genome-wide data pro ling binding of
human transcription factors. Finally, we provide examples of discovered expression
programs involved in key pathways related to the response to infection. We conclude
the chapter with Section 4.4, in which we discuss the signi cance of our results.
36
Page 37
hidden
CHAPTER 2
Genetic RegulAtory Modules (GRAM)
\This we know: the earth does not belong to man, man belongs
to the earth. All things are connected like the blood that unites
us all. Man did not weave the web of life, he is merely a strand
in it."
|From the apocryphal Chief Seattle speech1
Understanding of regulatory interactions and molecular mechanisms governing
genetic networks is of fundamental importance to basic biology, and is also relevant
to improved diagnosis and treatment of human diseases. A variety of new high-
throughput data sources have recently become available, and these hold the promise
of revolutionizing molecular biology by providing a large-scale view of the regulation
of genes in the cell. Fundamental goals at this scale involve discovering patterns of
combinatorial regulation and how the activity of genes involved in related biological
processes is coordinated and interconnected. However, each high-throughput data
source measures only a particular aspect of cellular activity and su ers from limita-
tions in accuracy. Thus, an important goal is to integrate information from multiple
data sources, so that each type of data can compensate for the limitations of the
others. A further goal is to develop automated methods that can aid in deducing
abstractions that can conceptually reduce genetic network complexity without signif-
icant loss of explanatory power.
Initial work on constructing genome-wide regulatory networks relied exclusively
on expression data (see section 2.1 for details). However, these approaches assume
1These words are commonly thought to have been spoken by Chief Seattle, the Native American
leader, in 1854. However, they were actually written by screenwriter Ted Perry for a 1972 movie,
Home, about ecology [10]. However, because the apocryphal speech has been quoted so much, it is
possible that history has nally been altered, and Chief Seattle really did say all this.
37
Page 38
hidden
that expression levels of regulated genes depend on expression levels of regulators.
This assumption is often not biologically realistic, because the expression levels of
many regulators do not re
ect their physiologic activity due to factors such as post-
translational modi cations, protein degradation mechanisms, and cellular sequestra-
tion of regulators [143].
Large scale, genome-wide location analysis for DNA-binding regulators o ers a
second means for identifying regulatory relationships [118]. Location analysis iden-
ti es physical interactions between regulators and DNA regions, providing strong
direct evidence for genetic regulation. Although useful, binding information is also
limited, as the presence of the regulator at a promoter region indicates binding but
not function: the regulator may act positively, negatively or not at all. In addition,
as with all microarray based data sources, location analysis data contains substantial
experimental noise. Because expression and location analysis data provide comple-
mentary information, our goal was to develop an ecient computational method for
integrating these data sources. We expected that such an algorithm could provide
assignments of groups of genes to regulators that would be both more accurate and
more biologically relevant than assignment based solely on either data source alone.
In this chapter, we present a novel algorithm, GRAM (Genetic RegulAtory Mod-
ules), which fuses information from genome-wide expression and in vivo transcription
factor-DNA binding data sets to discover regulatory networks of gene modules. A
gene module is de ned as a set of genes that are both co-expressed and bound by the
same set of transcription factors. Unlike previous approaches [58, 162, 99, 26, 191]
that have relied primarily on functional information from expression data, the GRAM
algorithm explicitly links genes to the factors that regulate them by using DNA bind-
ing data to incorporate direct physical evidence of regulatory interactions.
We use the GRAM algorithm to discover a genome-wide regulatory network using
binding information for 106 transcription factors in Saccharomyces cerevisiae in rich
media conditions and over 500 expression experiments. We validate the quality of
these results by performing analyses using four independent data sources. We then
use the discovered modules to label transcription factors as activators or repressors
and identify patterns of combinatorial regulation. Further, we present a method
for using modules to build automatically genetic regulatory sub-networks for speci c
biological processes, and use this to reconstruct accurately key elements of the cell-
cycle in yeast. Finally, we analyze a new genome-wide location analysis data set for
regulators in yeast cells treated with rapamycin, and use the GRAM algorithm to
provide biological insights in this regulatory network.
2.1 Related work
Computational discovery of genetic regulatory networks has been a very popular
research area in recent years. In this section, we will focus on previous work most
relevant to our own. For additional prospectives on this topic, the reader is encouraged
to consult a recent review such as [169].
Many studies have focused on networks derived from a single data type, such as
38
Page 39
hidden
gene expression or protein interaction data. For instance, Friedman et al. [68] and
Hartemink et al. [84] used limited expression data to learn static Bayesian networks
for the regulation of gene expression in S. cerevisiae. In a more recent approach, Segal
et al. constructed a probabilistic model that uses expression data to construct a large-
scale map linking regulators to regulated genes [191]. All these studies assume that
expression levels of regulated genes depend on expression levels of regulators, which
can be biologically unrealistic as discussed in the introduction to this chapter.
Other studies have combined genome-wide binding data with other data types,
but, unlike our method, these have used a strict cuto for binding data, reducing
it to a binary relationship. For example, Hartemink et al. [85] used binding data
to constrain the structure of a learned static Bayesian network. Ideker et al. [96]
combined protein interaction and binding data to construct a network structure, and
then used expression data to identify speci c subnetworks in that network.
Several methods have been developed that combine DNA sequence motifs or in-
formation from gene function databases with expression data to discover sets of pre-
sumably co-regulated genes [99, 162]. As an example, the methods of Ihmels et al.
and Pilpel et al. start with an initial set of genes that are selected using a certain
criteria (e.g., DNA binding motif or MIPS functional category) and use expression
data to re ne the initial set [99, 162]. Although these methods represent an impor-
tant rst step, our method improves upon them in several ways. First, the GRAM
algorithm exhaustively and eciently searches the combinatorial space of subsets of
transcriptional factors. This allows us to distinguish modules that were found to be
identical in previous work but that are controlled by di erent transcription factors
(see Section 2.6.1). Second, unlike these previous methods, the GRAM algorithm
comprehensively combines the two data sources, binding and expression data, by re-
visiting the binding data after re ning an initial gene set. Finally, our method focuses
not only on the genes themselves, but also on the relationships between transcription
factors and genes. This allows us to further re ne our understanding of the cell's
regulatory network, by assigning functional annotations to transcriptional factors.
We also note that all the prior approaches discussed above generate static or
steady-state networks. In contrast, we present methods to reconstruct regulatory
networks using temporal information. This is especially important in the analysis of
dynamic processes in the cell, such as replication.
2.2 The GRAM algorithm
In this section we present the GRAM algorithm for discovering gene modules. As
described above, modules are sets of genes that are both co-expressed and regulated
by the same set of transcription factors (TFs). The computational challenges we face
are:
1. High dimensionality of expression data. Our algorithm must handle continuous
data for thousands of genes measured in hundreds of experiments.
2. Large number of potential regulators. Each module is potentially regulated by
39
Page 40
hidden
any combination of over one hundred transcription factors.
3. Noisy data. Both genome-wide binding and expression data are measured ex-
perimentally using DNA microarrays, which produce notoriously noisy output.
The GRAM algorithm addresses these challenges by using ecient methods for
robust determination of nearby genes in the high-dimensional expression data space
and search over combinations of transcriptional regulators. The algorithm begins by
performing an exhaustive search over all possible subsets of regulators indicated by
the DNA-binding data with a stringent criterion for determining binding. Once a set
of genes bound by a common set of transcriptional regulators is found, the algorithm
identi es a subset of these genes with highly correlated expression, which serves as a
\seed" for a gene module. The algorithm then revisits the binding data, and seeks
to add additional genes to the module that are similarly expressed and bound by the
same set of transcriptional regulators using a relaxed binding criteria. Note that the
GRAM algorithm allows genes to belong to more than one module. In the following
subsections we present a formal description of the algorithm.
2.2.1 Formal model description
Figure 2-1 provides pseudo-code for the GRAM algorithm. The inputs consist of
matrices of expression data values and binding data p-values. That is, let E denote
the matrix of continuous expression data, where the rows represent genes and the
columns represent D experiments. We assume that expression data has been mean-
centered and normalized by the standard deviation for each gene. Let B denote the
matrix of binding p-values, where rows correspond to genes and columns correspond
to transcription factors, i.e., bit denotes the binding p-value of TF t for gene i. Below
we discuss details of each step of the algorithm.
Initialization with TF binding patterns
The rst step of the algorithm is the construction of a series of sets, F1; : : : ;FJ , of
all possible regulatory TF combinations strictly implied by the binding data, and all
subsets of these combinations. Each set Fj contains all sets of TFs of size j that
are implied by the data. To be precise, let T (i; p) denote the set of all transcription
factors that bind to a gene i with p-value less than p, i.e., the list of factors t such
that bit < p. We denote an element of Fj by Fk, i.e., Fk 2 Fj implies that jFkj = j
and Fk  T (i; p1) for some gene i and a strict binding p-value p1.
The algorithm proceeds by iterating over all elements of each Fj beginning with the
highest numbered set, i.e., with j = J . Note that there is a potentially exponential
number of combinations of TFs to be considered as regulators for gene modules.
However, because we restrict our search to combinations \con dently" implied by the
data, the number of subsets to be searched over is much smaller. This feature of the
algorithm enabled it to operate on a data set containing genome-wide binding data
for over 100 TFs, as discussed in section 2.3.
40
Page 41
hidden
GRAM(E,B,p1,p2)
// E and B are matrices of expression and binding experiments, respectively
// p1 and p2 are strict and relaxed binding thresholds, respectively
Initialization:
Construct a series of sets, F1; : : : ;FJ , of all possible regulatory TF combinations
strictly implied by the binding data, and all subsets of these combinations.
GJ all genes
// Gj is the working set of genes in iteration j
For all sizes of TF regulatory sets j = J; : : : ; 1:
For all subsets of TFs Fk 2 Fj:
G G(Fk; p1), a set of genes for a candidate module
c core expression pro le for the set G(Fk; p1)
Expand G to M(Fk), the nal module, by including genes with a relaxed binding threshold
Output M(Fk), if the module has a sucient number of genes
Gj1 Gjn
S
kM(Fk) such that jFkj = j
Figure 2-1: Pseudo-code for the GRAM algorithm for identifying gene modules from
genome-wide expression and transcription factor binding data. See the text for details.
Finding core expression pro les
The algorithm seeks to nd a core expression pro le for all genes strictly bound
by a set of TFs, i.e., a point in expression space that is signi cantly close to as
many co-bound genes as possible. Note that the core expression pro le is a robust
estimate in the sense that it is insensitive to co-bound genes with outlying expression
measurements.
Let Gj denote the set of genes being considered by the algorithm for sets of TFs
of size j, i.e., jFkj = j. Let G(Fk; p) denote the set of all genes in Gj to which all the
TFs in Fk are bound with a given p-value threshold p, i.e., all the genes l 2 Gj such
that Fk  T (l; p). Then, for every Fk 2 Fj, the genes in G(Fk; p1) serve as candidates
for a module regulated by the factors in Fk.
For each candidate set G(Fk; p1) of n genes (where n is greater than some thresh-
old), the algorithm attempts to nd a core expression pro le. That is, the algorithm
seeks a point c in expression space such that for an expression distance threshold rn
(depending on the number of genes in the set, as described below), the ball centered
at c of radius rn contains as many genes in G(Fk; p1) as possible.
We de ne the distance between two points x and y in normalized expression space
as:
d(x;y) =
q
PD
d=1(xd yd)
2
D
We then denote by C(Fk; p1; c) the set of genes in G(Fk; p1) that are all at distance
41
Page 45
hidden
Figure 2-2: Visualization of the transcriptional regulatory network discovered by the
GRAM algorithm as a graph with edges between gene modules and regulators shows that
there are many groups of connected gene modules/regulators involved in similar biological
processes. The network consists of 106 modules containing 655 distinct genes regulated by
68 transcription factors. In most cases in which a gene module is controlled by one or more
regulators, there was previous evidence suggesting that these regulators physically or func-
tionally interact (see Table 2.1 for details). The directed arrows point from transcription
factors to the gene modules that they regulate. Blue arrows indicate discovered activator
regulatory relationships (see Table 2.3 and the text for details). Gene modules are colored
according to the MIPS category to which a signi cant number of genes belong (signi cance
test using the hypergeometric distribution, p < 0:005). Modules containing many genes
with unknown function or an insigni cant number belonging to the same MIPS category
are uncolored. When the gene modules discovered by the GRAM algorithm were compared
to results generated using location data alone, the GRAM algorithm yielded an almost
three-fold increase in modules signi cantly enriched for genes in the same MIPS category.
45
Page 52
hidden
Figure 2-5: Motif enrichment: genes in modules discovered by the GRAM algorithm are
more likely to show independent evidence of co-regulation by the regulators assigned to
the module when compared to sets of genes obtained using genomic location analysis data
alone, as demonstrated by an enrichment for the presence of known DNA-binding motifs.
We identi ed 34 transcriptional regulators that bind to genes in at least one module and
have well-characterized DNA binding motifs in the TRANSFAC database [136]. For each
of these 34 transcriptional regulators, we generated a list of genes in modules bound by the
regulator and a second list of genes bound by the regulator using location analysis data alone
(stringent p-value cuto of 0.001). We then computed the percentage of genes from each list
that contained the appropriate known motif in the upstream region of DNA. In most cases,
the percentage of genes containing the correct motif was higher when modules generated
using the GRAM algorithm were used versus sets of genes generated from location analysis
data alone. See the supplementary materials for [21] for a complete list of transcription
factors.
52
Page 53
hidden
might serve as a repressor under certain conditions and as an activator under others.
Activators identi ed by our algorithm
Factor Module function Corr.
w/
mod-
ule
Comments
Ste12 Pheromone response +0.64 Activator, required for
pheromone response
Hap4 Respiration +0.60 Activator of CCAAT box
containing genes
Yap1 Detoxi cation +0.53 Activator, possibly involved
in oxidative stress response
Nrg1 Carbohydrate transport +0.50 Previously identi ed as a re-
pressor
Fkh1 Cell-cycle +0.49 Activator of cell-cycle genes
Cad1 Detoxi cation +0.47 Activator, involved in multi-
drug resistance
Aro80 Energy and metabolism +0.40 Activator, involved in regula-
tion of amino acid synthesis
Swi6 Cell-cycle +0.39 Activator of cell-cycle genes
Msn4 Stress response +0.38 Activator, involved in stress
response
Fkh2 Cell-cycle +0.37 Activator of cell-cycle genes
Hsf1 Stress response +0.36 Activator of heat shock re-
lated genes
Table 2.3: Eleven activators were identi ed by our algorithm by computing the correlation
between the expression patterns of genes in a module and its regulators. Ten of the eleven
activators were previously identi ed in the literature.
2.3.4 Discovery of modes of combinatorial regulation
A central feature of eukaryotic transcriptional regulation is combinatorial control, the
ability of di erent combinations of transcription factors to specify distinct biological
outcomes [200]. We sought to determine how combinations of regulators a ect ex-
pression by examining the correlation between expression pro les of genes from pairs
of modules that share at least one transcription factor but have no genes in common.
This analysis suggests four distinct types of coordinate regulation:
 Additive|the binding of additional factors increases the expression levels of
the bound genes. Our analysis suggested that this is the most common type
of coordinate regulation. The complete set of pairs of modules that exhibit
additive control appears on the supplementary website [20].
53
Page 56
hidden
Figure 2-6: Combinatorial regulation modes: Our analysis of pairs of modules that share at
least one transcription factor but have no genes in common revealed several distinct types
of coordinate regulation. Examples of two such types are shown in the above gures. (Top
gure) Delayed regulation: in delayed control, a factor regulates two or more modules in
a similar way, but the expression of these sets of genes are temporally separated, an e ect
brought about by the activity of di erent bound partner factors. As can be seen, the average
expression of genes in the module regulated by Swi6/Mbp1 lags that of genes in the module
regulated by Swi6/Swi4, though both belong to the G1 phase. (Bottom gure) Conditional
regulation: in conditional control, a partner factor a ects expression primarily in a subset
of conditions. As can be seen, under many conditions the average expression pro le of genes
in the module regulated by Met4 is very similar to that of genes in the module regulated by
Met4/Cbf1. However, there are some experiments in which the average expression pro les
of genes in the two modules are anti-correlated, most notably under stress conditions.
56
Page 57
hidden
decreased translation initiation, decreased ribosome biogenesis, amino acid perme-
ase regulation, and autophagy [195, 38, 86, 175]. Expression analysis indicates that
Tor signaling also controls transcriptional regulation of metabolic pathways involving
nitrogen metabolism, glycolysis and the TCA cycle [83, 195, 38].
The rapamycin response presented an ideal opportunity for applying the GRAM
algorithm to analyzing a novel transcriptional regulatory network. Previous studies
suggested a speci c set of regulators that are likely to function in the transcriptional
response to rapamycin [83, 195]. Also, several publicly available genome-wide expres-
sion datasets measuring response after rapamycin treatment are available [83, 195].
More importantly, the fact that there is little information about the transcriptional
regulatory network involved and how this transcriptional network may contribute
to the overall response to rapamycin treatment presented an opportunity for new
biological insights.
2.4.1 Selection of relevant factors
We selected 14 transcriptional regulators that seemed likely to function in the ra-
pamycin response in S. cerevisiae based on evidence from the literature. These factors
and our rationale for choosing them are:
 Gln3 and Gat1|these factors have been identi ed as general activators of ex-
pression of nitrogen responsive genes. The activated Tor proteins lead to seques-
terization of Gln3/Gat1 in the cytoplasm, and subsequent rapamycin treatment
and Tor inactivation allows Gln3/Gat to enter the nucleus [48, 195]. Gln3/Gat1
apparently activate the allantoin pathway [188].
 Gzf3|this factor and Dal80 have been identi ed as general repressors of the
nitrogen responsive genes, and there is evidence that these repressors operate
by competing with Gln3/Gat1 DNA-binding [46].
 Dal80|known to repress the allantoin pathway [188]. See also Gln3/Gat1 and
Gzf3.
 Dal81 and Dal82|these factors have been identi ed as positive regulators of
the allantoin pathway, which degrades allantoin and its metabolic products to
ammonia and carbon dioxide [188].
 Msn2 and Msn4|rapamycin apparently allows the transcription factors Msn2/4
to enter the nucleus. Msn2/4 generally respond to cellular stress, including
carbon source limitation [48, 195].
 Rtg1 and Rtg3|the activated Tor proteins apparently maintain the transcrip-
tional regulators Rtg1/Rtg3 in the cytoplasm and rapamycin treatment allows
nuclear release. Rtg1/Rtg3 generally regulate TCA cycle and glyoxylate cycle
genes [48, 195].
57
Page 59
hidden
conditions. Thirty-nine gene modules containing 317 unique genes and regulated by 13
transcription factors were discovered (see Figure 2-7 and the supplementary material
for [21]). It should be noted that many of the transcription factors pro led have
been demonstrated to be regulated by cytoplasmic sequesterization [195, 49], so we
did not expect to be able to identify activator/repressor relationships by searching
for transcription factor expression pro le correlations with regulated modules. The
GRAM algorithm added 192 gene-regulator interactions that would not have been
identi ed with a strict p-value (0.001) in the location analysis experiments.
Figure 2-7: Rapamycin gene modules network: analysis of the rapamycin transcriptional
regulatory network revealed a number of novel biological insights, including evidence that
some transcriptional regulators may control genes involved in biological pathways di erent
from those generally associated with these regulators. Further, analysis of the network
suggested more complex regulatory interactions in which there is communication among
modules. Such complicated network topologies may be important for facilitating rapid and
exible responses to changing environmental conditions. See the text for further details.
Thirty-nine modules containing 317 unique genes and regulated by 13 transcription factors
were discovered. Red arrows between transcriptional regulators indicate that the source
transcription factor binds at least one module containing the target transcription factor.
Modules are colored according to the MIPS category to which a signi cant number of genes
belong (signi cance test using the hypergeometric distribution, p-value < 0:05).
As in the case for the rich media gene modules network, many features of the
rapamycin regulatory network discovered by the GRAM algorithm were consistent
59
Page 62
hidden
Figure 2-8: Cell-cycle dynamic regulatory network: in order to investigate the yeast cell-
cycle, we applied our sub-network discovery algorithm in combination with a continuous
temporal alignment algorithm to uncover not only modules and their regulating transcrip-
tion factors, but also the temporal relationships among these modules. The automatically
recovered network is similar to the one described in Simon et al. [199], which required con-
siderable prior biological knowledge to construct. Modules are shown as ovals containing
the names of regulating factors. Blue lines indicate that a transcription factor regulates
a module (the arcs indicate the temporal extent of the factor's regulatory activity). Red
dashed lines indicate that a transcription factor is itself contained in a module.
.
62
Page 63
hidden
Our algorithm correctly assigned all the regulators to stages of the cell-cycle in
which they have been described to function in previous studies (see Simon et al. [199]
and references within). Signi cantly, this accurate reconstruction of the regulatory
architecture was automatic and required no prior knowledge of the regulators that
control transcription during the cell-cycle. Many of the discovered modules were
regulated by sets of transcription factors that are known to be associated or in com-
plexes, such as Swi4/Swi6, Swi6/Mbp1, Swi5/Ace2, and Fkh1/Fkh2/Ndd1/Mcm1.
Interestingly, the recovered network suggests that combinatorial factor interactions
may provide control that allows for sub-dividing cell-cycle phases into di erent bio-
logical functions. For instance, a module regulated by Mbp1/Swi4/Swi6 contained
genes involved with budding and could be localized at the M/G1 boundary. Another
module regulated by Mbp1/Swi6 could be localized at almost the same time, but
contained a number of genes involved with DNA recombination and repair. A mod-
ule in the middle of the G1 phase was regulated by Swi4/Swi6 and contained genes
involved in cell-wall synthesis, and one module at the G1/S boundary was regulated
by Swi4/Fkh2/Ndd1 and contained many genes involved with histone synthesis.
2.6 Discussion
2.6.1 Comparison to previous methods
Several other methods have also been used to discover gene modules. Two methods
that also analyzed regulatory networks in yeast produce modules that are directly
comparable to those discovered by the GRAM algorithm [162, 99]. These two meth-
ods, described in Pilpel et al. [162] and Ihmels et al. [99] start with an initial set of
genes that are selected using a certain criteria (DNA binding motif in the promoter or
MIPS functional category) and use expression data to re ne the initial set. Figures 2-9
and 2-10 present a comparison between modules discovered by the GRAM algorithm
and those generated by the methods of Pilpel et al. [162] and Ihmels et al. [99]. As
can be seen in these gures, our results represent a re nement of the modules from
the other studies. In particular, our method identi es not only the genes that par-
ticipate in a certain module, but also provides evidence as to the factors that are
used to activate these genes. Our dynamic sub-network discovery procedure provides
further re nement by distinguishing between genes that are regulated di erently and
participate in di erent cell-cycle phases.
2.6.2 Limitations of the GRAM algorithm
Although we clearly demonstrated that the GRAM algorithm produces biologically
meaningful results and improves on previous methods, several limitations should be
pointed out. To begin, the exhaustive search used by the algorithm is ecient for the
data set we analyzed, but is infeasible for extremely large data sets of thousands of
transcription factors. As more binding data is collected for other organisms such as
humans, data sets may approach such sizes.
63
Page 66
hidden
Another issue with our approach relates to the method for detecting activators and
repressors. The algorithm labeled only a relatively small number of transcription fac-
tors as activators or repressors. This may be due to several issues. First, the method
used cannot detect the many factors that are post-transcriptionally activated and
thus have expression levels that are not expected to
uctuate signi cantly. Second,
we used the entire set of expression experiments to determine the activator/repressor
relationships. Although using more data can produce more statistically signi cant
results, it may be that only under certain conditions a factor serves as an activator
or repressor. Finally, we required a very high correlation between modules and reg-
ulating factors. In general, by relaxing threshold parameters, the algorithm can be
used in an exploratory mode to discover more relationships but with less con dence.
2.6.3 Conclusion
In this chapter we presented GRAM, a novel algorithm for discovering modules of
genes that are both similarly expressed and regulated by the same set of transcription
factors. GRAM integrates expression and binding data to help compensate for tech-
nical limitations in either data source alone. We applied GRAM to several biological
data sets in order to determine how genes are regulated in cells, and how various
systems in the cell respond to external stimuli. In the future, genomic binding data
obtained under a variety of conditions is likely to become available and should be of
great value in further discovery of genetic regulatory networks. Overall, the GRAM
algorithm facilitates a genome-wide approach to analysis of transcriptional regula-
tory networks that can suggest speci c novel regulatory models, which can then be
validated in more directed experimental studies.
66
Page 67
hidden
CHAPTER 3
GeneProgram
\The fear of in nity is a form of myopia that destroys the pos-
sibility of seeing the actual in nite, even though it in its highest
form has created and sustains us, and in its secondary trans -
nite forms occurs all around us and even inhabits our minds."
|Georg Cantor
An important research problem in computational biology is the identi cation of
expression programs, sets of co-activated genes orchestrating physiological processes,
and the characterization of the functional breadth of these programs. The use of
mammalian expression data compendia for discovery of such programs presents sev-
eral challenges, including: 1) cellular inhomogeneity within samples, 2) genetic and
environmental variation across samples, and 3) uncertainty in the numbers of pro-
grams and sample populations.
In this chapter, we present GeneProgram, a new unsupervised computational
framework that uses expression data to simultaneously organize genes into overlap-
ping programs and tissues into groups to produce maps of inter-species expression
programs, which are sorted by generality scores that exploit the automatically learned
groupings. Our method addresses each of the above challenges by using a probabilis-
tic model that: 1) allocates mRNA to di erent expression programs that may be
shared across tissues, 2) is hierarchical, treating each tissue as a sample from a pop-
ulation of related tissues, and 3) uses Dirichlet Processes, a non-parametric Bayesian
method that provides prior distributions over numbers of sets while penalizing model
complexity. Using synthetic and real gene expression data, we show that GenePro-
gram outperforms several popular expression analysis methods in recovering coherent
and biologically interpretable gene sets. From a large compendium of mouse and
human expression data, GeneProgram discovers 19 tissue groups and 100 expression
programs active in mammalian tissues.
67
Page 69
hidden
from simpler organisms. First, tissue samples usually represent collections of diverse
cell-types mixed together in di erent proportions. Even if a sample consists of a
relatively homogenous cell population, the cells can still behave asynchronously, due
to factors such as microenvironments within the tissue that receive di erent degrees
of perfusion. Second, each tissue sample is often from a di erent individual, so that
the compendium represents a patchwork of samples from di erent genetic and envi-
ronmental backgrounds. Finally, the number of expression programs and distinct cell
populations present in a compendium is e ectively unknown a priori.
We present a novel methodology, GeneProgram, designed for analyzing large com-
pendia of mammalian expression data, which simultaneously compresses sets of genes
into expression programs and sets of tissues into groups. Speci c features of our algo-
rithm address each of the above issues relating to analysis of compendia of mammalian
gene expression data. First, our method handles tissue inhomogeneity by allocating
the total mRNA recovered from each tissue to di erent gene expression programs,
which may be shared across tissues. The number of expression programs used by a
tissue therefore relates to its functional homogeneity. We address the second issue, of
tissue samples coming from di erent individuals, by explicitly modeling each tissue
as a sample from a population of related tissues. That is, related tissues are assumed
to use similar expression programs and to similar extents, but the precise number of
genes and the identity of genes used from each program may vary in each sample.
Additionally, populations of related tissues are discovered automatically, and provide
a natural means for characterizing the generality of expression programs. Finally,
uncertainty in the numbers of tissue groups and expression programs is handled by
using a non-parametric Bayesian technique, Dirichlet Processes, which provides prior
distributions over numbers of sets.
To understand the novel contributions of the GeneProgram algorithm, it is useful
to view our framework in the context of a lineage of unsupervised learning algorithms
that have previously been applied to gene expression data. These algorithms are di-
verse, and can be classi ed according to various features, such as whether they use
matrix factorization methods [8], heuristic scoring functions [42], generative proba-
bilistic models [197], statistical tests [190, 209], or some combinations of these meth-
ods [23, 55]. The simplest methods, such as K-means clustering, assume that all genes
in a cluster are co-expressed across all tissues, and that there is no overlap among
clusters. Next in this lineage are biclustering algorithms [42, 209, 39, 133, 210], which
assume that all genes in a bicluster are co-expressed across a subset rather than
across all tissues. In many such algorithms, genes can naturally belong to multiple
biclusters.
GeneProgram is based on two newer unsupervised learning frameworks, the topic
model [64, 79] and the Hierarchical Dirichlet Process mixture model [211]. The topic
model formalism allows GeneProgram to further relax the assumptions of typical
biclustering methods, through a probabilistic model in which each gene in an expres-
sion program has a (potentially) di erent chance of being co-expressed in a subset
of tissues. The hierarchical structure of our model, which encodes the assumption
that groups of tissues are more likely to use similar sets of expression programs in
similar proportions, also provides advantages. Hierarchical models tend to be more
69
Page 70
hidden
robust to noise, because statistical strength is \borrowed" from items in the same
group for estimating the parameters of clusters. Additionally, hierarchical models
can often be interpreted more easily|in the context of the present application, the
inferred expression programs will tend to be used by biologically coherent sets of tis-
sues. Finally, through the Dirichlet Process mixture model formalism, GeneProgram
automatically infers the numbers of gene expression programs and tissue groups. Be-
cause this approach is fully Bayesian, the numbers of mixture components can be
e ectively integrated over during inference, and the complexity of the model is auto-
matically penalized. This is in contrast to previous methods that either require the
user to specify the number of clusters directly or produce as many clusters as are
deemed signi cant with respect to a heuristic or statistical score without providing a
global complexity penalty. We note that Medvedovic et al. have also applied Dirichlet
Process mixture models to gene expression analysis, but not in the context of topic
models, Hierarchical Dirichlet Processes, or mammalian data [139].
As with previous methods [9, 25, 204, 225], we leverage the power of cross-species
information to discover biologically relevant sets of co-expressed genes. However,
these previous analyses generally required genes to be co-expressed across large sets
of experiments [25, 204, 225, 123]. In contrast, GeneProgram uses expression data
more
exibly, and is thus able to produce a re ned picture of gene expression across
species: expression programs may be used by only a subset of tissues, and may be
unique to one species or shared across multiple species; tissue groups are similarly
exible. This probabilistic view of expression programs captures the intuition that
the general structure of many programs is evolutionarily conserved, but some genes
may be interchanged or lost.
3.2 Dirichlet Processes
The task of assigning data to clusters is a classic problem in machine learning and
statistics. A common approach to this problem is to construct a model in which data
is generated from a mixture of probability distributions.
In nite mixture models, data is assumed to arise from a mixture with a pre-
determined number of components [138]. The diculty with such models is that the
appropriate number of mixture components is not known a priori for many modeling
applications. This issue is generally addressed by constructing a series of models
with di erent numbers of components, and evaluating each model using some quality
score [138].
An alternate, fully Bayesian approach is to build an in nite mixture model, in
which the number of mixture components is potentially unlimited, and is itself a
random variable that is part of the overall model. Obviously, only a nite number
of mixture components can have data assigned to them. However, we still imagine
the data as arising from an in nite number of components; as more data is collected,
more components may be used to model the data more accurately. Thus, the in nite
mixture model is a nonparametric model, in the sense that the number of model
parameters grows with the amount of data. The challenge with such a model is how
70
Page 71
hidden
to place an appropriate prior on the in nite number of mixture component parameters
and weights.
The Dirichlet Process (DP), a type of stochastic process rst introduced in the
1960's [67] and originally of mostly theoretical interest [61, 62], has recently become
an important modeling tool as a prior distribution for in nite mixture models1. In
this section, we will introduce the main concepts of DPs necessary to understand the
GeneProgram model. In this regard, we will focus on a constructive de nition of DPs
in the context of priors for in nite mixture models. This development, which avoids
measure theory, closely parallels that presented by [151] and [171].
A recent extension to the standard DP model is the Hierarchical Dirichlet Process
(HDP), in which dependencies are speci ed among a set of DPs by arranging them in
a tree structure [211]. HDPs are useful as priors for hierarchical mixture models, in
which data is arranged into populations that preferentially share the usage of mixture
components. In this section, we will discuss the original HDP formulation by Teh et
al. in the context of in nite mixture models.
The use of DPs for real-world applications is predicated on practical inference
methods. A great advance in this regard has been the development of ecient Markov
Chain Monte Carlo (MCMC) methods for approximate inference for in nite mixture
models using DP priors [193, 152, 171]. Although other approximate inference meth-
ods have been developed [141, 29, 114], MCMC remains the most widely used and
versatile method. In particular, ecient MCMC schemes have been developed for
HDP models [211], and can be readily extended for the GeneProgram model. Thus,
our discussion of DP inference in this section will be restricted to MCMC methods.
The remainder of this section is organized as follows. First, we describe how
Dirichlet Processes arise as priors in terms of the in nite limit of mixture models.
Next, we describe the extension of DPs to HDPs. Finally, we describe basic MCMC
sampling schemes for DPs and HDPs.
3.2.1 Probability models
Bayesian nite mixture models
We begin by de ning a typical Bayesian nite mixture model, which we will sub-
sequently extend to the in nite case. Figure 3-2 depicts the model using standard
graphical model notation with plates. The model consists of J mixture components,
where each component j has associated with it a mixture weight denoted j and
a parameter vector denoted j. Assume we have N data points denoted xi, where
1  i  N . Each data point is assigned to a mixture component via an indica-
tor variable zi, i.e., the probability that data point i is assigned to component j is
p(zi = j j ) = j or zi j  Multinomial( j ). The conditional likelihood for each
1It is widely believed that the Dirichlet Process was named after the mathematician, Johann
Peter Gustav Lejeune Dirichlet (1805{1859). However, there is some evidence that its name may be
related to certain dubious and fantastical tax collection practices of the French government during its
colonization of the Ivory Coast (see Figure 3-1). The sole source for this reference is a possibly forged,
old, \Ripley's Believe It or Not!" newspaper cartoon, so the veracity of this claim is questionable.
71
Page 72
hidden
Figure 3-1: The Dirichlet Process may be related to certain dubious and fantastical tax
collection practices of the French government during its colonization of the Ivory Coast.
The \Ripley's Believe It or Not!" newspaper cartoon shown above lends credence to this
claim.
data point may then be written as:
p(xi j zi = j;) = F (xi j j)
Here, F ( j ) is a probability density function parameterized by .
To complete the model, we need to de ne prior distributions over the parame-
ters. We will assume that the component parameters are drawn i.i.d. from some
base distribution H, i.e., j  H(). We also need to specify a prior distribution for
the weight parameters. As is typical for Bayesian mixture models, we will assume a
symmetric Dirichlet prior on the mixture weights, i.e.,  j J;  Dirichlet( j =J).
One consequence of using a symmetric prior is that it is not sensitive to the order of
the component parameters. Note that the Dirichlet prior is conjugate to the multi-
nomially distributed weights, so that the posterior is also a Dirichlet distribution.
To summarize, our J-dimensional mixture model is de ned as:
 j ; J  Dirichlet( j =J)
j j H  H()
72
Page 74
hidden
second line of the derivation follows simply from Bayes' theorem. The nal line of
the derivation follows from conjugacy between the Dirichlet prior on the weights and
the multinomial distribution on the assignment variables. Thus, the density function
under the integral is that of a non-symmetrical Dirichlet distribution, allowing us to
derive the nal closed form expression.
In nite mixture models and Dirichlet Processes
In this subsection we show how the Dirichlet Process arises as a prior for in nite
mixture models.
Figure 3-3 depicts an in nite mixture model using standard graphical model no-
tation with plates. As can be seen from the gure, the model is almost structurally
identical to the nite version. The distinguishing feature is that the weight and
parameter vectors are now in nite dimensional.
The challenge with this model is then to de ne an appropriate prior for the in-
nite dimensional parameters and weights. As with any mixture model, the in nite
dimensional weights must sum to one. A probability distribution that generates such
weights is the stick-breaking distribution, denoted Stick( ), where is a scaling or
concentration parameter (discussed in more detail below). This distribution is de-
ned constructively. Intuitively, we imagine starting with a stick of unit length and
breaking it at a random point. We retain one of the pieces, and break the second
piece again at a random point. This process is repeated in nitely, producing a set of
random weights that sum to one with probability one [193]. To be more precise, the
jth weight j is constructed as:
0j j  Beta(1; )
j = 
0
j
j1Y
l=1
(1 0l)
The in nite mixture model can be constructed using the stick-breaking distribu-
tion as a prior on the mixture weights and the base distribution H as a prior on the
component parameters. This can be summarized as:
 j  Stick( )
j j H  H()
zi j   Multinomial( j )
xi j zi = j;  F ( j j)
Note that this construction produces a vector  with a countably in nite number
of dimensions, whose components all sum to one, and H is sampled independently
a countably in nite number of times to generate the mixture component parameter
values.
To establish the connection between Dirichet Processes and the model described
74
Page 75
hidden
xi N
zi
π Η
8

θj
α
Figure 3-3: A graphical model depiction of the in nite mixture model. Circles represent
variables, and arrows denote dependencies among variables. Vectors are depicted with bold
type, and observed variables are shown inside shaded circles. Rectangles represent plates,
or repeated sub-structures in the model.
above, we consider the distribution over all possible component parameter values for
the in nite mixture model. This distribution will be non-zero at a countably in nite
number of values. Formally, we denote this distribution by G and can write it as:
G( ) =
1X
j=1
j( j)
Here, is an arbitrary parameter value, and () is the standard delta-function, which
is non-zero only when its argument is zero.
Each distribution G thus constructed can be viewed as a sample from a stochastic
process, which can in fact be proven to be the Dirichlet Process (see [102] and [193]).
In general, we will characterize a Dirichlet Process by a scalar parameter , called
the concentration parameter, and a base distribution H. A sample from a Dirichlet
Process, which we denote G j ;H  DP( ;H), is thus a distribution that is non-zero
over a countably in nite number of values (with probability one). As we have seen,
each sample e ectively parameterizes an in nite dimensional mixture model.
The concentration parameter a ects the expected number of mixture compo-
nents containing data items when the DP is used as a prior for the in nite mixture
model. As shown in [12], the expected number of non-empty mixture components J
75
Page 79
hidden
We can thus combine equations 3.2, 3.3 and 3.4 to obtain the posterior distribu-
tions for the assignment variables:
p(zi = j j zi; ;; x) /
nij
N 1 +
p(xi j j) for nij > 0 (3.5)
p(zi 6= zl;8 l 6= i j zi; ;; x) /

N 1 +
Z
F (xi j )H( )d (3.6)
Thus, for each iteration, we sample the mixture component assignments for all
data points using equations 3.5 and 3.6. For the rst J components already con-
taining data items, we use equation 3.5 to compute the assignment probability. We
use equation 3.6 to compute the probability of assigning the data point to a new
mixture component. Notice that in equation 3.6, we integrate over the mixture com-
ponent parameters, as any component parameters are possible for a new component.
Sampling is most ecient when F () and H() are conjugate. However, in cases of
non-conjugacy of these distributions, Monte Carlo methods may be used [152, 171].
We also need to sample from the posterior for the concentration parameter . It
can be shown that the conditional distribution for is given by (see [151]):
p( j J;N;a ) / a

1 +J1ea

2 B( ;N)
Here, B(; ) is the standard Beta function de ned as:
B(u; v) =
(u)(v)
(u+ v)
=
Z 1
0
u1(1 )v1d
Escobar and West describe an ecient sampling scheme for [60]. They noted
that p( j J;N) can be written as a marginalization over an auxiliary variable :
p( j J;N;a ) /
Z 1
0
p( ;  j J;N;a )d
p( ;  j J;N;a ) / a

1 +J1ea

2  1(1 )N1
From the joint distribution, we can see that:
p( j ; J;N;a ) / Gamma( j a 1 + J 1; a

2 ln )
p( j ; J;N) / Beta( j ;N)
Thus, by sampling from the above two conditional distributions, we can sample from
the posterior for to update the concentration parameter during the MCMC sampling
iterations.
79
Page 80
hidden
Hierarchical Dirichlet Process models
Teh et al. described an MCMC method for HDP in nite mixture models that uses
auxiliary variables to make sampling from the conditional distributions ecient [211].
Figure 3-6 provides an overview of the sampling scheme.
Repeat for all data subsets t = 1 : : : T and data items i = 1 : : : N :
Sample zti, the assignment of data item i from subset t to a mixture component,
from its posterior, i.e., p(zti j zi; 0;; x; 1)
If the data item has been assigned to a new component, sample a new
top-level mixture weight 0 from the stick-breaking distribution and
a new mixture component parameter  from its posterior
Repeat for all non-empty mixture components j = 1 : : : J :
Sample the component parameter j from its posterior
Sample the top-level mixture weights 0 from their posterior
Sample the concentration parameters 0 and 1 from their posteriors
Figure 3-6: One iteration of the basic MCMC sampling scheme for the Hierarchical Dirichlet
Process mixture model with two levels.
The rst task is to sample the data point assignment variables, z. The method
for this is similar to that used for ordinary Dirichlet Process mixture models. We
begin by considering a nite mixture model of dimension J and integrating out the
individual mixture weights t to obtain the conditional probability of z given 0:
p(z j 0; 1) =
TY
t=1
( 1)
( 1 +Nt)
JY
j=1
( 1 0j + ntj)
( 1 0j )
(3.7)
Here, Nt denotes the number of data items in subset t, and ntj represents the number
of data items from subset t assigned to mixture component j. It can be shown that in
the limit of an in nite mixture model, the conditional probability has a particularly
simple form:
p(zti = j j zi; 0; 1) / 1 0j + n
i
tj
By combining this with the conditional likelihood for data points, F ( j ), we obtain
the posterior distribution for assigning data points to mixture components:
p(zti = j j zi; 0;; x; 1) / ( 1 0j + n
i
tj )F (xti j j) (3.8)
This equation holds if j is a non-empty component. The posterior distribution for
assigning a data point to a new component is given by:
p(zti 6= ztl 8 t; l 6= i j zi; 0;; x; 1) / ( 1 0)
Z
F (xti j )H( )d (3.9)
80
Page 81
hidden
Here, we de ne 0 = 1
PJ
l=1
0
l , where there are J components with data points
assigned to them. As with ordinary DPs, Monte Carlo methods may be used if F ( j )
and H() are non-conjugate distributions.
So, to sample the data point assignments we use equations 3.8 and 3.9. If a data
point is assigned to a new component, we must also generate a new weight 0J+1 using
the stick-breaking distribution, i.e., we sample b  Beta(1; 0) and set 0J+1 b
0
 .
To sample from the model posterior, we also must sample the top-level weights 0.
The method for this relies on a \trick" using auxiliary variables. For the derivation,
we need to use a general property of ratios of Gamma functions given by:
(n+ a)
(a)
=
nX
m=0
s(n;m)am (3.10)
Here, n and a are natural numbers. In equation 3.10, the ratio of Gamma functions
has been expanded into a polynomial with a coecient s(n;m) for each term. These
coecients are called unsigned Stirling numbers of the rst kind, which count the
permutations of n objects having m permutation cycles (see [2]). By de nition,
s(0; 0) = 1, s(n; 0) = 0, s(n; n) = 1 and s(n;m) = 0 for m > n. Additional coecients
are then computed recursively using the equation s(n+1;m) = s(n;m1)+ns(n;m).
Note that the 0 weights in the conditional probability p(z j 0) in equation 3.7
occur as arguments of ratios of Gamma functions. These ratios can be expanded to
yield polynomials in the 0 weights:
( 1 0j + ntj)
( 1 0j )
=
ntjX
mtj=0
s(ntj;mtj)( 1
0
j )
mtj (3.11)
An ecient sampling method can be derived by introducing m as auxiliary variables.
The conditional distributions for sampling m and 0 can be shown to be:
p(mtj = m j z;mtj; 0) / s(ntj;m)( 1 0j )
m (3.12)
p( 0 j z;m) / ( 0)
01
JY
j=1

P
tmtj1
j / Dirichlet(
X
t
mt1; : : : ;
X
t
mtJ ; 0) (3.13)
Finally, we need to sample the concentration parameters 0 and 1 for the HDP.
As with the regular DP model, we will assume Gamma priors on the concentration
parameters.
For 0, it can be shown that:
p(J = J j 0;m) / s(M;J) J0
( 0)
( 0 +M)
Here, M =
P
t
P
jmtj and J is the number of non-empty mixture components. Com-
bining the above equation with the prior for 0 yields the conditional probability for
0, which can be sampled using the same method as described for sampling concen-
81
Page 82
hidden
tration parameters for regular DPs.
Sampling 1 requires the introduction of two additional auxiliary variables w and
b. The following update equations can then be derived:
p(wt j 1) / w
1
t (1 wt)
Nt1
p(bt j 1) /

Nt
1
bt
p( 1 j w;b;a 1) / Gamma(a
1
1 +
TX
t=1
(Mt bt); a
1
2
TX
t=1
logwt)
Here, a 11 and a
1
2 are the hyperparameters for the Gamma prior on 1 and Mt =PJ
j=1mtj.
3.3 The GeneProgram algorithm and probability
model
3.3.1 Algorithm overview
The GeneProgram algorithm consists of data integration (pre-processing), model in-
ference, and distribution summary steps as depicted in Figure 3-7. Data integration
makes data from multiple species comparable and discretizes it in preparation for in-
put to the model. The rst data integration step combines replicates and normalizes
microarray data to make measurements of gene expression comparable across tissues.
The second data integration step uses a pre-de ned homology map to convert species
speci c gene identi ers into meta-gene identi ers. Meta-genes are virtual genes that
correspond one-to-one with genes in each species. Some genes do not have counter-
parts in other species, and these are ltered out. In the nal data integration step,
continuous expression measurements are discretized. The model inference step seeks
to discover underlying expression programs and tissue groups in the data probabilis-
tically. To accomplish this, we use Markov Chain Monte Carlo (MCMC) sampling
to estimate the model posterior probability distribution. Each posterior sample de-
scribes a con guration of expression programs and tissue groups for the entire data
set; more probable con gurations tend to occur in more samples. The nal step
of the algorithm is model summarization, which produces consensus descriptions of
expression programs and tissue groups from the posterior samples.
3.3.2 The probability model
Intuitive overview
We can understand the GeneProgram probability model intuitively as a series of
\recipes" for constructing the gene expression of tissues. Figure 3-8 presents a cartoon
of this process, in which we imagine that we are generating the expression data for the
82
Page 88
hidden
Var. Dim. Description Cond. distribution or prior
xti 1 Expression event i in tissue t; Multinomial, given the assignment to
corresponds directly to observed expression program j.
data.
zti 1 Assignment variable of an Generated from mixing probabilities over
expression event to an expression expression programs for the tissue, i.e.,
program, i.e., zti = j indicates that p(zti = j j t) = tj .
expression event i in tissue t is
assigned to expression program j.
t 1 Mixing probabilities over expression DP, given the assignment of the tissue to
programs for tissue t. group k, its parent DP mixing probabilities,
and its concentration parameter, i.e.,
t j qt = k; 1;
k  DP( 1; k):
k 1 Mixing probabilities over expression DP, given its parent mixing probabilities
programs at the level of and concentration parameters, i.e.,
tissue group k; middle k j 0; 0  DP( 0; 0).
level in the DP hierarchy.
0 1 Root level mixing probabilities in DP, generated from the stick-breaking
the DP heterarchy. distribution given its concentration
parameter, i.e., 0 j 0  Stick( 0).
j G Parameters for expression, Dirichlet distribution prior
program j, describing a multinomial (parameterized by ).
distribution over G meta-genes.
 1 Pseudo-count parameter for a Gamma distribution prior with a two-
symmetric Dirichlet distribution. dimensional hyperparameter vector a.
qt 1 Assignment variable of tissues to Generated from mixing probabilities over
groups, i.e., qt = k indicates that tissue groups, i.e, p(qt = k j ) = k.
tissue t belongs to tissue group k.
 1 Mixing probabilities over the tissue DP, generated from the stick-breaking
groups. prior given its concentration parameter,
i.e,  j
 Stick(
).
1 1 Concentration parameter for t. Gamma distribution prior with two-
dimensional hyperparameter vector a 1 .
0 1 Concentration parameter for 0 Gamma distribution prior with two-
and k. dimensional hyperparameter vector a 0 .
1 Concentration parameter for . Gamma distribution prior with two-
dimensional hyperparameter vector a
.
Table 3.1: Summary of random variables in the GeneProgram model. The columns are:
variable name (vectors are in bold type), dimensions of the variable, description, and the
conditional or prior distribution on the variable.
88
Page 89
hidden
εx
ti
8
T
N
t
aα0
z
ti
π
t
λ

8
aα1aγγ
θ
j
q
t
β0
βkα1 α0
Figure 3-10: The GeneProgram model is depicted using graphical model notation with
plates. Circles represent variables, and arrows denote dependencies among variables. Vec-
tors are depicted with bold type, and observed variables are shown inside shaded circles.
Rectangles represent plates, or repeated sub-structures in the model. See the text and
Table 3.1 for details.
generated by an expression program. The variable zti assigns gene expression events
to programs, i.e., zti = j indicates that xti was generated from the jth expression
program. An expression program is a multinomial probability distribution over genes.
To be precise, let j represent a parameter vector of size G for expression program
j. Then, the probability of generating expression event xti corresponding to gene g
given that it is assigned to expression program j is p(!(xti) = g j zti = j;j) = jg.
We use a symmetric Dirichlet prior for j with parameter , and a Gamma prior for
 with hyperparameter vector a.
The mixing probabilities over expression programs are generated by the DPs in
the hierarchy. To be precise, let t denote the mixing probabilities at the leaf level
in the DP hierarchy. That is, t denotes the mixing probabilities over expression
programs for tissue t, i.e., p(zti = j j t) = tj. Let k denote the mixing prob-
abilities at the middle level in the DP hiearchy. That is, k denotes the mixing
probabilities over expression programs at the level of tissue group k. Finally, we let
0 denote the root-level mixing probabilities. In the stick-breaking construction for
HDP models, it is assumed that root level mixing probabilities are generated by the
stick-breaking distribution, i.e., 0 j 0  Stick( 0), where 0  Gamma(a 0). The
hierarchical structure of the model then implies that k is conditionally distributed
as a Dirichlet Process, i.e., k j 0; 0  DP( 0; 0), where we assume that k also
89
Page 95
hidden
gene set no. tissue pop. 1 tissue pop. 2 tissue pop. 3 tissue pop. 4
1 30 25 3 3
2 3 30 25 3
3 5 3 37 20
4 3 3 20 20
Table 3.2: Tissue population means for synthetic data. Each tissue population was asso-
ciated with a mean vector of the numbers of genes to be used from each gene set. For a
tissue from a population, the number of genes to be used from a gene set was sampled from
a Poisson distribution using the population mean.
3.4.2 GeneProgram accurately recovered coherent gene sets
from the noisy synthetic data that other algorithms
could not
Hierarchical clustering is one of the most frequently used methods for clustering mi-
croarray gene expression data. Figure 3-11 (part B) shows the results of hierarchical
clustering, using Pearson correlation as a similarity metric and the average linkage
method [135], applied to sorting both rows (genes) and columns (tissues) of the syn-
thetic data. As can be seen, hierarchical clustering did reasonably well at sorting
tissues and genes independently, although it did not separate gene sets three and four
correctly. But, this method's failure to consider genes and tissues simultaneously is
known to break up coherent \overhanging" blocks of genes, making interpretation
dicult [42, 209, 39]. This issue was demonstrated in this example by gene sets one
and two that \overhang," thus causing gene set two to be broken up horizontally
(blue rows in Figure 3-11 part A and part B).
The inability of hierarchical clustering to handle \overhanging" block structures
in data was one of the motivations for the development of biclustering algorithms
that take genes and tissues into account simultaneously [42, 209, 133]. To investigate
the behavior of biclustering algorithms, we used Samba, an algorithm that has been
shown previously to outperform other biclustering methods [209, 194]. Samba pro-
duced 23 biclusters from the synthetic data (not shown). This method tended to nd
small subsets of genes co-expressed in some tissues, but did not recover the four gene
sets as coherent biclusters. Presumably, this is because Samba does not attempt to
incorporate more global constraints on biclusters.
Singular Value Decomposition (SVD) is a matrix factorization method that can
be used to approximate a matrix using a smaller number of factors or components.
In the context of gene expression data, the method has previously been used to
decompose data into \eigengenes" and \eigenexperiments," linear combinations of
genes and experiments respectively [8, 9]. However, SVD often produces components
that are dicult to interpret [39, 36, 111, 115]. As can be seen in Figure 3-11 (part
C), components produced by SVD [135] do not clearly correspond to the distinct gene
sets or tissue populations in the synthetic data. For instance, the rst component
is to some extent a composition of gene sets one, three and four, and subsequent
95
Page 96
hidden
components then add and subtract di erent combinations of the gene sets.
The development of non-negative matrix factorization (NMF) methods was in
part driven by the aforementioned problems with SVD. NMF algorithms decompose
a matrix into the product of non-negative matrices [115]. These algorithms generally
produce more interpretable factors than does SVD, and have been successfully applied
to various problems including gene expression analysis [39, 36, 111]. Figure 3-11 (part
D) shows the application of an NMF-based algorithm to the synthetic data. We used
a publicly available implementation [36], which searches for an optimal number of
factors using a cophenetic clustering coecient metric, and in this case found three
factors to be optimal. As can be seen in the gure, NMF did fairly well at recovering
genes sets one and two, although there was some overlap between the sets. However,
gene sets three and four were indistinguishable.
Figure 3-11 (part E) demonstrates the application of a simpli ed version of Gene-
Program in which tissue groups were not modeled (all tissues were attached to the
root of the hierarchy). As can be seen, this version of the algorithm accurately re-
covered gene sets one and two. However, as with NMF, gene sets three and four
completely overlapped.
Figure 3-11 (part F) shows the application of GeneProgram with full automatic
learning of tissue groups enabled. As can be seen, the algorithm accurately recovered
all four gene sets. By leveraging hierarchical structure in the data, the algorithm
had additional information (the pattern of expression program use by related tis-
sues), which presumably allowed it to correctly recover all the synthetic gene sets|
something the other methods were not capable of.
3.5 GeneProgram discovered biologically relevant
tissue groups and expression programs in a
large compendium of human and mouse body-
wide gene expression data
Our objective was to apply GeneProgram to a large compendium of mammalian
gene expression data, both to compare our method's performance against that of
other algorithms, as well as to explore the biological relevance of discovered tissue
groups and expression programs. In this regard, we used the Novartis Gene Atlas
v2 [205], consisting of genome-wide expression measurements for 79 human and 61
mouse tissues. This dataset was chosen because it contains a large set of relatively
high-quality expression experiments, with body-wide samples representative of normal
tissues measured on similar microarray platforms. Further, the data is from two
species, potentially allowing for the discovery of higher quality cross-species gene
expression programs.
96
Page 97
hidden
A. Simulated data B. Hierarchical clustering
E. GeneProgram (flat hierarchy)
F. GeneProgram (full model)
D. Non-negative Matrix Factorization (NMF)
C. Singular Value Decomposition (SVD)
1 2 3 4
1
2
3
4
ge
ne
s
tissues
97
Page 99
hidden
in at least one expression program and 31% were shared by several expression pro-
grams. Forty-two percent of tissue groups and 33% of expression programs contained
at least one tissue from each species. It is important to realize that the number of
cross-species tissue groups and expression programs was limited by the data set: only
62 out of the 140 tissue samples could be directly paired between species and some
key tissues with distinct functions, such as the stomach and eye, were represented in
only one species.
3.5.3 GeneProgram automatically assigned tissues to biolog-
ically relevant groups
To provide a quantitative assessment of the biological relevance of sets of tissues,
we manually classi ed tissues into 10 high-level, physiologically based categories and
then calculated an enrichment score for each discovered tissue group using the hyper-
geometric distribution. See the supplemental online material for [76] for the complete
manually derived tissue categories. To correct for multiple hypothesis tests, we used
the procedure of Benjamini and Hochberg [24] with a false-discovery rate cut-o of
0.05.
Seventy-nine percent of tissue groups had signi cant enrichment scores, and in
all such cases, the score was signi cant for only a single category (see Figure 3-
12). For instance, tissue group \L," which was signi cantly enriched only for the
\hematological/immune" category, consisted exclusively of human immune cells such
as natural killer cells, and CD4+ and CD8+ T-cells. As another example, tissue group
\B," signi cantly enriched only for the \neural" category, consisted exclusively of
neural tissues from both species. We note that GeneProgram discovered these groups
in a wholly unsupervised manner, and that many of the groups clearly represent
a more re ned picture of the data than the 10 broad categories we had manually
compiled.
3.5.4 GeneProgram outperformed biclustering algorithms in
the discovery of biologically relevant gene sets
Because expression programs characterize both genes and tissues, we used both Gene
Ontology (GO) categories [11] and the 10 manually derived tissue categories to as-
sess GeneProgram's ability to recover biologically relevant gene sets and to compare
this performance to that of two biclustering algorithms, Samba [209, 194] and a
non-negative matrix factorization (NMF) implementation [36]. We chose these two
algorithms for comparison because they are popular in the gene expression analysis
community, they have previously outperformed other biclustering algorithms, and
available implementations are capable of handling large data sets.
We mapped genes to GO annotations using RefSeq identi ers from the May 2004
(hg17) and August 2005 (mm7) assemblies of human and mouse genomes [11, 219].
For calculating enrichments, we used both mouse and human GO annotations from
the biological process categories with between 5 and 200 genes. Enrichment score
99
Page 100
hidden
embryo day 8.5
blastocysts
embryo day 9.5
embryo day 10.5
embryo day 7.5
embryo day 6.5
caudate nucleus
whole brain
medulla oblongata
prefrontal cortex
occipital lobe
subthalamic nucleus
cingulate cortex
fetal brain
dorsal root ganglia
dorsal striatum
amygdala
hypothalamus
cerebellum
spinal cord
pituitary
pituitary
eye
medial olfactory epi.
hippocampus
parietal lobe
thalamus
pons
amygdala
hypothalamus
olfactory bulb
trigeminal
cerebellum
spinal cord lower
substantia nigra
spinal cord upper
preoptic
frontal cortex
cerebral cortex
temporal lobe
globus pallidus
cerebellum ped.
seminiferous tubule
testis
Leydig cell
testis, interstitial
testis, germ cell
testis
CD4+ T-cells
CD8+ T-cells
B220+ B-cells
thymus
lymph node
lymph node
tonsil
thymus
spleen
fertilized egg
oocyte
large intestine
small intestine
kidney
stomach
liver
liver
kidney
fetalliver
heart
skeletal muscle
brown fat
skeletal muscle
heart
tongue
sup. cerv. gangl.
trigeminal gangl.
AV node
ciliary ganglion
appendix
uterus corpus
ovary
skin
dorsal root gangl.
adrenal cortex
adrenal gland
olfactory bulb
umbilical cord
placenta
ovary
uterus
bladder
bone marrow
bone
bone marrow
CD14+ monocytes
CD19+ B-cells
CD33+ myeloid
BDCA4+ dentritic
CD56+ NK-cells
whole blood
CD8+ T-cells
CD4+ T-cells
CD34+ cells
promyel. leukemia (hl60)
lymphoblastic leukemia (molt4)
chronic myel. leukemia (k562)
Burkitts lymphoma (Daudi)
CD71+EarlyErythroid
colorectal adenocarcinoma
Burkitts lymphoma (Raji)
CD105+ endothelial
bronchial epithelial cells
B-lymphoblasts (721)
adrenal gland
trachea
adipose tissue
lung
trachea
thyroid
salivary gland
mammary gland
prostate
salivary gland
vomeralnasal organ
cardiac myocytes
smooth muscle
adipocyte
uterus
fetal thyroid
thyroid
prostate
fetal lung
placenta
lung
pancreas
pancreas
pancreatic islets
digits
epidermis
snout epi.
tongue
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
A. Reproductive/embryonic
B. Neural
C. Reproductive/embryonic
D. Hematologic/immune
E. Reproductive/embryonic
F. Digestion/GI
G. Digestion/GI
H. Muscle
I. N/A
J. Reproductive/embryonic
K. Hematologic/immune
L. Hematologic/immune
M. Malignancy
N. N/A
O. Respiratory
P. N/A
Q. N/A
R. Digestion/GI
S. Epidermis
Significant tissue
category enrich-
ments for groups
100

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Post Doc
 
50% Assistant Professor
by Country
 
100% United States