Reconstructing gene regulatory ne...
Reconstructing gene regulatory networks: from random to scale-free connectivity J. Wildenhain and E.J. Crampin Abstract: The manipulation of organisms using combinations of gene knockout, RNAi and drug interaction experiments can be used to reveal regulatory interactions between genes. Several algorithms have been proposed that try to reconstruct the underlying regulatory networks from gene expression data sets arising from such experiments. Often these approaches assume that each gene has approximately the same number of interactions within the network, and the methods rely on prior knowledge, or the investigator���s best guess, of the average network connec- tivity. Recent evidence points to scale-free properties in biological networks, however, where network connectivity follows a power-law distribution. For scale-free networks, the average number of regulatory interactions per gene does not satisfactorily characterise the network. With this in mind, a new reverse engineering approach is introduced that does not require prior know- ledge of network connectivity and its performance is compared with other published algorithms using simulated gene expression data with biologically relevant network structures. Because this new approach does not make any assumptions about the distribution of network connections, it is suitable for application to scale-free networks. 1 Introduction The development of computational techniques to identify the transcription networks underlying observed gene expression patterns is an important challenge in the analysis of gene expression data. Significant progress has been made in the last few years in characterising regulatory interactions at the genomic level [1���9], including methods for identify- ing gene and protein interactions, regulatory modules occurring with high frequency in the genome and the identi- fication of transcription motifs [10���13]. In parallel, several different approaches have been pro- posed by which gene regulatory networks can be identified directly from gene expression data sets [2, 14���16]. The aim of these so-called ���reverse-engineering��� approaches is to allow investigators to analyse data sets directly without making prior assumptions about the underlying networks. Gene regulatory networks, identified purely from transcrip- tional data, reflect regulatory influences, rather than mapping direct physical interactions between molecules. Methods for gene network reconstruction have been pro- posed on the basis of statistical analyses such as Bayesian networks [17, 18], Boolean models [19] and graphical Gaussian models [15, 16]. In this report, we focus on analysing networks from gene perturbation experiments, analysed previously using singular value decomposition (SVD) [14, 20, 21] and a linear regression approach [2]. From the perspective of network reconstruction, one of the most important features of gene regulatory networks is that they are sparsely connected: the average number of connections per node in the network is small in comparison to the number of nodes [22]. Jeong et al. [23] analysed protein���protein interaction maps from Saccharomyces cer- evisiae and estimated an average degree kav per gene of 2.4. Guelzim et al. [24] showed for the yeast transcriptional network that the incoming and outgoing connections (in- and out-degrees) have a similar proportion in the genome. For this reason, progress can be made despite difficulties in achieving adequate coverage of all potential regulatory interactions in data sets (for example, there are typically fewer data points than interactions to be determined in gene expression data sets, especially for large-scale studies, including microarray approaches). Yeung et al. [14] have presented an approach based on SVD to infer the regulatory interactions, and they had success with limited amounts of data for networks of more than 200 genes. On the basis of this work, an improved approach to experimental design was developed, in which genes are selected iteratively for perturbation to reveal the architec- ture of the underlying network [20]. There is mounting evidence that many biological net- works, including metabolic [25], protein���protein inter- action [26] and transcriptional networks [27], as well as many other genomic indices [28], share the common prop- erty that the distribution of connections follows a power law, P(k) k2g. Here the degree distribution P(k) is the probability that a ���node��� (gene) of the network is connected to exactly k other nodes. Networks with this property are known as scale-free networks, as the relative probability for two different connectivities k depends only on the ratio of the connectivities, rather than on their absolute values. Although the power-law degree distribution is the distinguishing feature of scale-free networks, few # The Institution of Engineering and Technology 2006 IEE Proceedings online no. 20050092 doi:10.1049/ip-syb:20050092 Paper first received 15th November 2005 and in revised form 10th February 2006 J. Wildenhain was with the Bioinformatics Institute, School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, New Zealand and is now with Tyers Lab, Samuel Lunenfeld Research Institute, 600 University Avenue, Toronto, Ontario, Canada M5G 1X5 E.J. Crampin is with the Bioengineering Institute, Department of Engineering Science, University of Auckland, Private Bag 92019, Auckland, New Zealand E-mail: e.crampin@auckland.ac.nz IEE Proc.-Syst. Biol., Vol. 153, No. 4, July 2006 247
reverse-engineering algorithms have considered this prop- erty, and several of the most prominent algorithms implicitly assume that networks are well characterised by their average network connectivity. In this work, we consider the influence of the distribution of connections on network identification. Recently, Gardner et al. [2] described a reverse engineering algorithm based on a multiple regression approach, called NIR. Their approach requires that the average number of connections in the network be specified a priori, and that each node in the inferred network has this number of connections. This is clearly a limitation if biological networks are not well characterised by their average degree. Farina and Mogno [21] proposed the FAST algorithm that follows the SVD implementation by Yeung et al. [14], having a low compu- tational complexity compared with the NIR algorithm. Two new algorithms for reverse engineering gene regu- latory networks are proposed in this paper, called simple-to-general (S2G) and general-to-specific (G2S) [29]. Both algorithms use an iterative model selection tech- nique, described subsequently, to find the true underlying network. The two algorithms differ from one-another in that S2G iteratively builds up a network model, starting with no connections, whereas G2S starts with a fully con- nected network and sheds connections until an optimal network model is found. In either case, no assumptions need to be made about the network connectivity, distri- bution or average degree for the nodes, and it is this increased flexibility that makes these algorithms suitable for the identification of networks with more appropriate ���biological��� degree distributions, such as scale-free networks. Efforts to develop algorithms for identification of regulat- ory networks are, however, hindered by the quality and reliability of available data sets. In particular, benchmark- ing data sets on networks with known interactions are not readily available. For example, Gardner et al. [2] collected data for a sub-network of the SOS pathway in Escherichia coli, perturbing each gene in the sub-network in turn and recording the steady-state expression level. Using these data, Gardner et al. [2] applied the NIR algorithm and were able to infer many of the known interactions for what is a fairly well characterised pathway. However, the algorithm also identified a significant number of regulatory influences not previously recorded in the literature, which could correspond either to false positives or to the correct identification of previously unrecognised, yet genuine, regulatory interactions. This makes it difficult to use such data sets to assess the performance of their and other algor- ithms. In this situation, in silico modelling of gene networks can provide a platform by which to assess the performance of different algorithms [14, 20, 30���32]. In silico gene net- works generate gene expression data with well-defined properties. The parameters of such artificial networks can be varied systematically, and the data sets obtained provide an objective comparison of reverse engineering algorithms on gene networks with different topologies, and with varying degrees of biological variation and measurement noise. 2 Methods We developed a simulation environment similar to the one introduced by Mendes et al. [31]. This differential equation model, described subsequently, was used to generate steady-state gene expression data from perturbation exper- iments. To analyse the influence of network connectivity on the performance of the reverse engineering algorithms, we have used simulated data sets for random and scale-free networks of different sizes and noise levels in order to assess the performance of the existing and newly proposed algorithms. 2.1 Simulating gene expression data Initially, we specify parameters defining the network con- nectivity in order to generate a network. Gene expression data can then be simulated for this network by specifying kinetic functions representing the regulatory interactions. In our model, the number of regulatory interactions of each gene is given by the degree, k, of the corresponding node, and the different network topologies correspond to different distributions of k across the nodes in the network. We simulated three types of networks: random [33], scale-free and hierarchical network topologies [34], shown in Fig. 1. Random network. For a network of N genes, the random procedure connects each pair of nodes i and j with equal probability z, where the threshold for interaction z is equal to the average connectivity of the network, kav. Scale-free network. Following the work of Baraba ��si and Albert [35], the scale-free topology is built using the preferential attachment rule zi �� ki P j kj ��1�� where ki is the degree of node i and zi the probability threshold for new interactions for gene i. Initially all nodes i �� 1, . . . , N are assumed to have the same interaction probability. If a node acquires an interaction, the probability for new interactions increases according to (1). The algor- ithm selects two nodes randomly and tests against the prob- abilities zi(k) and zj(k). This procedure is terminated when the specified average connectivity for the network kav is reached, which, providing it is well below fully connected, generates the power-law property. Hierarchical network. In addition, we simulated hier- archical networks that are built by cloning a scale-free network and then adding connections between the clones, based on the probabilities zi(k). These hierarchical networks retain the scale-free property. For each network, the ratio of positive to negative con- nections is specified by the parameter r. If r �� 0.5, then the probabilities for an edge to be activating or inhibiting are equal. Increasing r leads to more positive regulatory interactions, decreasing it to more negative interactions. Savageau [36] developed a theoretical basis that established some properties of either mode of regulation, but to our knowledge little research has been published about this pro- portion in genomes [37]. After defining the network structure, we generate large-scale expression data sets for the given topology, based on a model of genetic networks using ordinary differ- ential equations (ODEs) [31]. Assuming that the levels of mRNA species are continuous and depend on the balance between transcription and degradation, (2) describes the general structure of the mathematical model dxi dt �� fi��x1 . . . xN �� bixi ��2�� where xi represents the abundance of the mRNA of gene i, fi(x1, . . . , xN) is the rate of transcription and bi represents the breakdown rate of species i. The regulatory effects of activating and inhibitory genes are represented in the rate function fi, which represents the influence of all regulatory IEE Proc.-Syst. Biol., Vol. 153, No. 4, July 2006 248