The Interplay between Microscopic and Mesoscopic Structures in Complex Networks
- DOI: 10.1371/journal.pone.0021282
- PubMed: 21829597
- arXiv: 1012.4524
Abstract
Understanding a complex network's structure holds the key to understanding its function. The physics community has contributed a multitude of methods and analyses to this cross-disciplinary endeavor. Structural features exist on both the microscopic level, resulting from differences between single node properties, and the mesoscopic level resulting from properties shared by groups of nodes. Disentangling the determinants of network structure on these different scales has remained a major, and so far unsolved, challenge. Here we show how multiscale generative probabilistic exponential random graph models combined with efficient, distributive message-passing inference techniques can be used to achieve this separation of scales, leading to improved detection accuracy of latent classes as demonstrated on benchmark problems. It sheds new light on the statistical significance of motif-distributions in neural networks and improves the link-prediction accuracy as exemplified for gene-disease associations in the highly consequential Online Mendelian Inheritance in Man database.
The Interplay between Microscopic and Mesoscopic Structures in Complex Networks
Jorg Reichardt,1, 2 Roberto Alamino,3 and David Saad3
1Institute for Theoretical Physics, University of Wurzburg, Wurzburg, Germany
2Complexity Sciences Center, UC Davis, Davis, USA
3The Nonlinearity and Complexity Research Group, Aston University, Birmingham,UK
Not all nodes in a network are created equal. Dierences and similarities exist at both individual node
and group levels. Disentangling single node from group properties is crucial for network modeling and
structural inference. Based on unbiased generative probabilistic exponential random graph models and
employing distributive message passing techniques, we present an ecient algorithm that allows one to
separate the contributions of individual nodes and groups of nodes to the network structure. This leads
to improved detection accuracy of latent class structure in real world data sets compared to models that
focus on group structure alone. Furthermore, the inclusion of hitherto neglected group specic eects
in models used to assess the statistical signicance of small subgraph (motif) distributions in networks
may be sucient to explain most of the observed statistics. We show the predictive power of such
generative models in forecasting putative gene-disease associations in the Online Mendelian Inheritance
in Man (OMIM) database. The approach is suitable for both directed and undirected uni-partite as well
as for bipartite networks.
Networks are fascinating objects. Charting the interactions between system constituents, abstracted as edges and nodes, has
allowed us to marvel the interconnectedness of systems and appreciate their complexity. Whether in understanding foodwebs [1],
social communities [2], protein-interaction [3, 4], metabolism [5], neural networks [6] or communication [7] the network-metaphor
has been highly successful in advancing our understanding of complex systems. While being esthetic charts of complex systems,
networks only reveal insights through rigorous statistical analysis and modeling. The abstraction of complex systems as networks, i.e.
graphs connecting nodes through links, provides a tremendous simplication. Not all network constituents are created equal; network
structure strongly depends on node properties and functions which may dier radically among individual nodes, or may be shared by a
whole group of nodes. However, in many circumstances, node properties or functions may not be accurately or not completely known
in contrast to the interactions between nodes, which often can be mapped accurately and eciently. A primary goal of network
research is to deduce unobserved, or latent, node properties through structural analysis.
Methodologically, structural analysis can proceed using either purely descriptive statistics or a generative model coupled with
inference techniques; both of which are complementary. However, a descriptive approach can only make assertions about a single
network, while generative models gives rise to a whole ensemble of networks that are statistically equivalent to the observed dataset.
Such ensembles are vital to the study of dynamical processes on networks but above all, they allow us to potentially dierentiate
between more and less important structural features. We will hence use generative models in our approach.
Thematically, research on network structural features has two foci. One is the study of microscopic properties, such as very small
subgraphs of a given size known as motifs [8] and their respective distributions in networks, or the properties of single nodes, like node
degree, centrality, betweenness or page-rank, which are correlated with latent node characteristics such as expansiveness, esteem,
functional importance or authority. The second investigates mesoscopic structural features indicative of properties or functions
shared by groups of nodes. Diering link densities between entire classes of nodes, assortative or dissortative mixing [9], community
structure [10, 11] or more generally, block structures [12, 13], are all examples of the latter. Interestingly, the two themes run
in parallel. When modeling degree distributions or analyzing motif distributions, group eects are rarely taken into account, while
individual node properties are generally neglected in inferring latent node classes from network structure via community or block
structure detection algorithms.
With this paper, we intend to close this gap. We present a principled probabilistic approach to the inference of latent node classes
based on a generative model that includes node specic features; these provide a more realistic model and enable one to quantify
probabilistic measures and condence levels of the observed structural details. To estimate model parameters, we employ distributive
message-passing techniques, with computational complexity scaling linearly with the problem size. Using real world data, social and
biological networks, we demonstrate the signicance of node specic parameters to the quality of latent class inference. Through
an example from neuroscience, we show that including group specic eects in the random null model used to assess the statistical
signicance of motif counts may provide a group based explanation for many motifs. Finally, we demonstrate how such generative
probabilistic models can be used for the prediction of putative new gene-disease associations.
I. METHODS
A. Exponential random graphs
When inferring structural features in networks with probabilistic generative models, we have to establish what constitutes a good
model. Within the innite set of possible models, exponential random graph models (ERGMs) [14, 15] stand out as combining desired
ar
X
iv
:1
01
2.
45
24
v1
[
co
nd
-m
at.
sta
t-m
ec
h]
2
1 D
ec
20
10
represented as squares, variables of the generative model as circles. Connections indicate which variables enter the calculation
of which factor. A) For a bipartite actor-event networks represented by an N ×M adjacency matrix Ai, class label i and
activity i of actor i enter in the calculation of all factors in row i. Equivalently, class label and popularity of event
enter in the calculation of all factors in column . The variables Brs denoting preference of actors in class r for events in
class s enter in every factor. Note that while each factor depends on only O(1) variables, the and variables enter in the
calculation of O(N), the and variables in O(M) and the Brs variables in O(NM) factors. B) Pictorial representation of
the messages involved in calculating Ri(i) sent from factor Ai to variable i according to equation (9). C) For directed
networks represented by non-symmetric N ×N adjacency matrices, the factors correspond to dyads Dij = (Aij ;Aji). Additional
to the interclass preference matrix, a symmetric matrix of reciprocities rs enters into the model (10). Every node i carries a
single class label i, activity i and attractiveness parameter i. The variables associated with node i enter in the calculation
of factors in both row i and column i.
properties that make them the preferred choice for the task. ERGMs are mean unbiased and make the observed data maximally likely.
They are maximum entropy models thus ensuring that one avoids adding redundant assumptions to the network's given structural
features. In other words, they parameterize the largest ensemble of networks compatible with our observations, while making the
observed network typical for the ensemble. Plus, their parameters have a very intuitive interpretation.
Within the framework of ERGMs it is easy to combine node specic with group specic eects. Consider a given, bipartite network
specied by its N ×M adjacency matrix A, representing for instance the attendance of N actors in M events. If actor i has attended
event , then Ai = 1 and otherwise Ai = 0. Equally, A could represent the choices of N consumers from a list of M products, or
the association of N diseases with M dierent genes. The possibilities are many and we will use the actor-event picture, but without
limiting the applicability of the model to this case alone.
We restrict ourselves to dyadic models, i.e. we assume the entries of the adjacency matrix Ai to be modeled by the conditionally
independent random variables Di ∈ {0;1}. A simple ERGM that captures both individual (actor- and event-specic) and group-
specic factors is given in terms of the odds ratio of actor i attending event [44]:
P(Di = 1∣⃗)
P(Di = 0∣⃗)
=
i
(1 − i)
(1 − )
Bi
(1 −Bi)
: (1)
The shorthand ⃗ denotes the set of all model parameters, i.e. in this case ⃗ ≡ (1; ::; N ; 1; ::; M ; 1; ::; N ; 1; ::; M ;B). Of these,
only a small subset is relevant for an individual dyad Di. The parameter i ∈ (0;1) denotes the global activity of actor i, higher
i meaning higher odds of attending any event. Correspondingly, ∈ (0;1) denotes the global popularity of event . Furthermore,
every actor i and every event carry a class index i and , respectively[45]. The matrix Brs ∈ (0;1)∀ r; s, models the data at a
coarser, group specic level, denoting the tendency or preference of an actor of class r to attend an event of class s. Higher entries
mean higher odds for the attendance of any actor of class r to any event of class s. The matrix Brs is also called a block model of
the data.
The rich literature on ERGMs [16] has generally assumed prior knowledge of the class labels i and in (1), or other covariates
[17{20]. Then, learning the parameters of (1) practically reduces to a simple logistic regression. However, the learning task is
considerably more complicated if the latent class labels i and are unknown and need to be inferred as hidden variables. On
the other hand, a growing body of work is dedicated to the development of ecient algorithms for learning general stochastic block
models [21{25] including the hidden assignment of nodes into classes, but without the incorporation of node specic eects, i.e. a
model specied by
P(Di = 1∣⃗)
P(Di = 0∣⃗)
=
Bi
1 −Bi
: (2)
This model is also referred to, with slight variations, as innite relational model [26] or mixed membership stochastic block model [27].
The common thread among these is the goal to model network structure entirely in terms of groups of nodes. Some attempts to
include the estimation of node specic eects have resulted in biased models [28, 29]. Within the framework of ERGMs, node and
group specic properties have been combined in so called latent space models [30, 31] where nodes are assigned a position in an
abstract space and links form as a function of their distance. Such models are well motivated for social networks, where homophily
than stochastic block models being caught in the predicament of placing groups of nodes with similar interaction partners in close
proximity while at the same time having to place them apart if these nodes are not densely connected among each other. Our
approach facilitates parameter estimates and latent class inference in model (1) which combines node specic eects with the more
general stochastic block models for group structure.
B. Model Inference
To describe the algorithm for estimating the parameters ⃗, we rst write the likelihood of the entire observed network adjacency
matrix A in terms of our model (1):
L(⃗) ≡ P(A∣⃗) =∏
i
P(Di = Ai∣⃗) (3)
For a dyadic model, the likelihood factorizes into terms that involve parameters associated with only two nodes.
Commonly used methods to estimate the parameters and hidden variables in such a model are to employ maximum likelihood
(ML) techniques in the form of an expectation-maximization type algorithm or Monte Carlo sampling [32]. We prefer a Bayesian
approach, based on Maximum A Posteriori (MAP) estimates that does not incur the computational cost of Monte Carlo sampling
while being less sensitive to initial conditions and more stable numerically than ML, especially as the parameters which maximize (3)
may lie on the the borders of the admissible interval (0;1). Furthermore, the MAP approach provides a natural Occam's razor as the
posterior distributions of parameter estimates can only reduce in variance with the provision of more data, while the ML approach
assumes point estimates or −functions for the posterior from the start. This is an important feature of the Bayesian approach as it
provides a natural limit for the number of inferred classes and condence levels in the assignments. Classes cannot be arbitrarily small
if the posterior for the inter-class link preference B is to be localized. In contrast, under an ML approach the likelihood increases
monotonically when more and hence smaller classes are used and model selection criteria, as in [20], are needed. Finally, Bayesian
techniques oer a principled way to incorporate prior domain knowledge for obtaining a more accurate approximate marginal posterior
distribution P(k ∣A), where k represents one of the parameters i; i; ; or Brs.
A message passing or belief propagation algorithm provides a principled way to calculate approximate posterior marginal distributions
[33, 34]. The starting point for this algorithm is a so-called factor- or dependency-graph, a graphical representation of the probabilistic
dependencies between the variables (model parameters) we wish to infer from the data, and the individual factors that constitute the
likelihood (3). Figure 1A shows this for the case of a bi-partite network, likelihood (3) and model (1).
The algorithm proceeds by exchanging messages, conditional probabilities, between factors and variables connected in the depen-
dency graph until convergence. Using the denitions:
Ri(k) ≡ P(Di = Ai∣k;A/Ai) and
Qi(k) ≡ P(k ∣A/Ai); (4)
one can interpret Ri(k) (R-Message) as the likelihood of a single observed matrix entry Ai given only the parameter k and
all the data matrix except for entry Ai. Equally, Qi(k) (Q-Message) is interpreted as the posterior probability distribution of
parameter k given the entire data matrix except for entry Ai. For the sake of notational economy, we have adopted to identify
functions by their argument. It is to be understood that Ri(i) is a dierent function than Ri() and not the same function
Ri(x) evaluated at the points i and as should be clear from the denitions (4).
Formally, we obtain the R-Message from Ai to k, by integrating out all parameters except k from a likelihood function
Ri(k) = ⨋ P(Di = Ai∣⃗;A/Ai)P(⃗/k ∣k;A/Ai)d⃗/k (5)
Using the independence of given data entries Ai we can readily identify P(Ai∣⃗;A/Ai) with the P(Ai∣⃗) of (1). Assuming the
joint distribution P(⃗∣A/Ai) factorizes with respect to every single k, one obtains the following closed set of equations:
Ri(k = x) = ⨋ P(Di = Ai∣⃗)∏
`≠k
Qi(`)d` and
Qi(k = x) ∝ P(k) ∏
j≠i
Rj(k = x): (6)
Although the factorization assumption may seem strong, it merely means that the Q-Messages P(k ∣A/Ai) for any two variables
k and ` with k ≠ ` are assumed independent. Given that these distributions are conditioned on the entire data matrix except for
one entry, the error we make using this assumption is considered negligible for large systems. The form of calculating Qi(k = x)
in (6) follows directly from Bayes' theorem and P(k) is the distribution we use to include prior information. These equations can
be iterated until convergence after which we nally obtain the desired approximate marginal posterior distribution, for every single
parameter, as:
P(k ∣A)∝ P(k)∏
i
Ri(k): (7)
c
t
o
r
s
a
E v e n t s
mesoscopic structure only
mesoscopic and microscopic structure combinedb
no. of participants
per event
no. of events
attended per actor
posterior probability of assigning
an actor to class 1 ( ) or class 2 ( )
an event to class 1 ( ) or class 2 ( )
no. of participants
per event
no. of events
attended per actor
an actor to class 1 ( ) or class 2 ( )
an event to class 1 ( ) or class 2 ( )
posterior probability of assigning
FIG. 2: Attendance record of 18 women (rows) to 14 informal social events (columns) due to ethnographers Davis, Gardner
and Gardner [35]. Black squares indicate attendance. a) Attendance matrix with posterior probability of class assignment for
actors P() and events P() as found by learning a standard stochastic block model (2). Classication inferred divides events
according to number of attendants and actors according to number of events participated in. The Inset shows the observed
numbers of attendances do not agree well with the expectations due to model (2). b) The same attendance matrix as in a)
but reordered due to the classication given in the original study indicated by dashed boxes [35]. Posterior probability of class
assignments inferred using model (1) is almost perfectly compatible with the expert's classication. Including node specic
popularity and activity parameters and allows to match observed numbers of attendances vs. expectations from model (1)
as shown in inset.
To illustrate these ideas, explicit update equations for the inference of the hidden class index i of node i appear below. Expressions
for other parameters will be reported elswhere. With
Xirs ≡ ∫ P(Di = Ai∣i; ; i = r; = s;Brs) ×
Qi(i)Qi()Qi(Brs)diddBrs ; (8)
we can write for the R- and Q-Messages between Ai and i:
Ri(i = r) = ∑
s
XirsQi( = s) and
Qi(i = r) ∝ P(i = r)∏
≠
Ri(i = r): (9)
The dependency graph greatly facilitates setting-up these update equations. Following the rules that R-Messages are always
sent from factors to variables and Q-Messages from variables to factors; and that in R-Messages, we sum or integrate over the
incoming Q-messages, while Q-Messages are proportional to the product of incoming R-Messages, we can write the equations based
on the dependency graph. Figure 1B shows a detail of 1A focussing on factor Ai to illustrate the messages involved in the
calculation of Ri(i) sent to variable i as in (9). Generalizing the algorithm and update equations to undirected, uni-partite
networks is straightforward. A more dicult task is the generalization to directed networks. Directed uni-partite networks are
generally represented by an N×N adjacency matrix A. They may exhibit a feature that undirected networks cannot have: non-trivial
reciprocity, i.e. the fact that the link from node i to node j is generally not independent of the link from node j to node i. Still, we
can model such networks using dyads (i; j) with i < j. But now, a pair of nodes can not only be connected or disconnected, but we
have four dierent states a dyad can be in: it can be disconnected: Dij = (0;0), with a connection running only from node i to j:
Dij = (1;0), or only from j to i: Dij = (0;1) or the connections between i and j may be reciprocated: Dij = (1;1). Hence, we
have to incorporate a parameter that can model this reciprocity. Instead of one global parameter, here we allow reciprocities to dier
depending on the latent classes:
P(Dij = (1;1)∣⃗)
P(Dij = (0;0)∣⃗)
=
ij
(1 − ij )
P(Dij = 1∣⃗)
P(Dij = 0∣⃗)
P(Dji = 1∣⃗)
P(Dji = 0∣⃗)
P(Dij = (1;0)∣⃗)
P(Dij = (0;0)∣⃗)
=
P(Dij = 1∣⃗)
P(Dij = 0∣⃗)
(10)
the parameters i and i model the tendency of node i to initiate and attract links, respectively. The model amounts to coupling
two independent models of the type (1) via the reciprocity parameters.
The likelihood of the observed data under such a model for given parameters is then given analogously to (3) by
L(⃗) ≡ P(A∣⃗) =∏
i<j
P(Dij = (Aij ;Aji)∣⃗) (11)
and factorizes into terms involving only the parameters of two dierent nodes. Figure 1C shows the dependency graph for such a
directed network. With the specication of the model and the dependency graph, the update equations are derived in the same way
as above.
II. RESULTS
Next, we will demonstrate the impact of microscopic (node specic) eects on inferred mesoscopic latent class structure by
comparing model (1) with the less expressive standard stochastic block model (2). For this, we use a dataset from sociology: the
Southern Women. Then, we will show the importance of mesoscopic group eects to the interpretation of microscopic structural
features by studying the motif distributions in the neural network of the nematode C. elegans. Last, we will determine the in
uence
of both node specic and group specic eects on the accuracy of predicting new links in networks. To this end, we compare model
(1) with both a less expressive model and a more expressive model in terms of classication accuracy and predictive power on the
network of gene-disease associations from the Online Mendelian Inheritance in Man (OMIM) database.
Southern women: This classic bipartite data set is due to ethnographers Davis, Gardner and Gardner [35]. A 18×14 matrix records
the attendance of 18 women in southern Alabama to 14 informal social events over the course of a nine month period in the 1930s.
The authors' aim was to study how an individual's social class in
uences her pattern of informal social interaction. Based on intuition
and experience in the eld, but without formal analysis, the authors suggested the existence of two latent classes of 9 women each,
with only little overlap in the attendance at events. Over the years, the data has become a standard test case of network analysis
algorithms, a meta-analysis of which can be found in [36]. We are interested in whether an inference based approach can assert the
presence of latent classes and whether the class assignments found correspond to those suggested by the eld experts.
First, we learn the standard stochastic block model with two latent classes for both actors and events, i.e. model (2), and only
estimate class membership i, and preference matrix Brs. Figure 2a shows the data, with rows and columns of the attendance
matrix reordered such that events/actors predominantly assigned to the same class are adjacent. The resulting block model is in
contrast to ndings of the original authors [35]. Events seem divided according to number of participants, i.e. popularity, while actors
seem divided according to number of events participated in, i.e. activity. The expert classication due to social class is not correctly
captured when trying to model the network through group eects alone. The reason for this failure is that under model (2), the
degree distribution for members of the same latent class is assumed to be Poissonian. The expected degree is the same for each
member of a class. The inset in gure 2a shows that this assumption cannot capture the observed degree distribution. Since the
standard stochastic block model does not model node degree independently of class preference; variance in degree distributions of
both actors and events confuses the inference of group membership.
In contrast, the inset in gure 2b shows the expected degree vs. the observed degree when including activity and popularity
parameters in the model as in (1) and allowing for two classes. Now, the observed degree distribution can be accounted for. The
introduction of activity and popularity parameters has also dramatic eects on the latent classes inferred. Figure 2b again shows
the attendance matrix, but this time, rows and columns are ordered as given in [35] and the authors' assignment to social class is
indicated by dashed boxes. The experts' classication matches almost perfectly that inferred using model (1). We can see that events
such as 8 and 9 which are attended by most actors receive high values and thus have very little discriminative power. Also, actors
who are very active and occasionally participate in events predominantly frequented by actors from the other group such as Mrs. N.
F. can still be assigned with high probability to a class, despite con
icting evidence in their participation record. Using model (1)
eectively allows one to decouple the preference eects of a group of actors for a group of events from global eects that contribute
to the variance in node connectivity.
Caenorhabditis elegans: In our second application, we explore the extent to which a dyadic model may explain the distribution of
small sub-graphs, termed motifs, in a network. Motifs have received considerable attention as possible entities of network formation,
i.e. building blocks larger than single edges. Their distribution relative to random null models has been suggested to characterize
entire classes of networks [8]. The over/under-representation of certain motifs with respect to random null models is often attributed
to possible evolutionary pressures due to a motif's potential in
uence on the performance of the network's function [38, 39].
We study the distribution of all 16 possible 3-node motifs in the 279 neuron chemical synapse network of C. elegans [37]. The null
model commonly used to assess whether a particular motif is under- or over-represented in a network is generated by randomizing the
original network conserving only microscopic structural features, i.e. the number of incoming, outgoing and reciprocated links at each
node is preserved. All other structural features and correlations are removed by the randomization. Figure 3b shows box-plots for
motif counts in 1000 such random networks and in comparison the actual count of the 16 motifs in the chemical synapse network of
C. elegans normalized to the mean count found in the set of null models. We can see that using this null model, 11 of the 16 motifs
are strongly over/under-represented and hence would qualify as possible starting points for further research on putative functional
relevance.
However, the standard null model also removes all mesoscopic structures, in particular structure due to groups of more than
three nodes. A dyadic model such as (10) lacks any parameter for three-node motifs but can generate an ensemble of null models
1.0
10
0.1
1.0
10
neural network
a b
c
link randomized model retaining only microscopic features ERGM model retaining both microscopic and mesoscopic features
FIG. 3: Motif counts in the synapse network of C. elegans compared to two random null models. a) Adjacency matrix of the
observed neural network [37]. b) Adjacency matrix of a typical realization of a link randomized version of the original data
and resulting Z-score statistics of motif counts. Counts in the original data (red x) are compared to box plots of counts in 1000
link randomized null models. Strong deviations are found at 11 of the 16 motifs. Since the link randomized null models retain
only node specic features, i.e. the numbers of incoming, outgoing and reciprocated links at each node, the cannot capture the
apparent mesoscopic structure in the original network and hence may over-estimate the statistical signicance of some motifs.
c) Adjacency matrix of a typical network generated from model (10) with both node specic as well as class specic parameters
estimated from the original network. 15 classes were used in this example. Using 1000 networks generated from model (10) as
a reference ensemble, the Z-score statistics show mild deviations only at 3 of the 16 motifs. This indicates that class structure
may oer a more parsimonious explanation for the observed motif distribution.
b
ERGM (1)
Newman/Leicht
Stochastic Block M. (2)
a
ERGM (1)
Newman/Leicht
Stochastic Block M. (2)
random expectation
FIG. 4: Classication accuracy and predictive power of network models (1), (2) and that by Newman/Leicht (NL) [29] . a)
Overlap of an expert classication of diseases in the Diseasosome-Network [40] and that inferred using models and the data of
known gene-disease associations recorded in the Online Mendelian Inheritance in Man (OMIM) database by Dec. 2005. Measure
of overlap is normalized mutual information (NMI) [41]. b) Prediction accuracy at 16 classes for conrmed associations added
to the OMIM database between Dec. 2005 and Jun. 2010. For each model, a candidate list of associations is obtained by
sorting all possible associations in descending order according to their probability under that model with parameters estimated
from the Dec. 2005 data. We plot which fraction of actually conrmed associations is found in the corresponding top fraction
of the candidate list. Entries due to new variants of a previously recorded association are listed as \repeated associations" while
genuine new associations are reported as \new associations". For example: In the top 1% of any candidate list, we expect to
nd 1% of new associations due to chance alone. We do nd 15% of all conrmed new associations if the list was due to model
(2), 20% if the list was due to the NL model and 30% if the list due to model (1). See text for details.
that matches the observed network in terms of the observed node specic degrees as well as with respect to mesoscopic structural
features. Such mesoscopic structure inevitably exists as neurons are located in dierent somatic regions and synaptic connections
between closely located neurons are more likely than between distant ones [42]. Neurons are also aggregated in dierent ganglia
making intra-ganglia connections more likely than inter-ganglia synapses. Furthermore, they serve dierent functions that in
uence
their connectivity. For example, stimuli may be processed in a sensory neuron - interneuron - motor neuron cascade. The latent
classes we infer from the data using (10) can be explained using a combination of these factors (see suppl. material). For instance,
some classes combine motor neurons only, some combine both sensory, inter- and motor-neurons which are all located in a particular
somatic region, while others can be explained post-hoc as lying in the same ganglia. More important than the interpretation of these
classes is whether a dyadic model that assumes all pairs of nodes as conditionally independent can account for the observed three
node motif-counts in the network.
Figure 3c shows the box-plots of motif counts in 1000 networks generated from (10) allowing for 15 dierent classes of neurons and
using the parameters estimated from the original network, again normalized to the mean count. The comparison with the motif-count
in the C. elegans network now shows that only 3 out of 16 motifs cannot be explained by the null model. This result is remarkable
explain a large portion of the three-node statistics in the observed data is a strong corroboration for our claim that latent classes of
nodes are important determinants of network structure. Furthermore, it oers a very parsimonious explanation of motif statistics in
this network and a more conservative estimation of their statistical signicance.
OMIM: As a last example, we study the in
uence of classication accuracy on the predictive power of probabilistic models using
a bi-partite network known as the human \Diseasosome-Network" [40]. It represents known associations between genes and diseases
recorded in the OMIM database [43]. The network was rst published in 2005 and we focus on the analysis of the largest connected
component involving 516 dierent diseases and 903 dierent genes connected by 1550 dierent associations known in 2005 [40]. The
original publication provided an expert classication of the diseases into 22 types. The type of disease is predominantly based on the
tissues and organs involved (such as bone, connective tissue, muscular, dermatological, hematological, renal, etc.) or based on the
aected system (such as skeletal, cardiovascular, immonological, metabolic or endochrinal, etc.)
To what extent does such a classication overlap with one inferred from a network of common genetic causes? We compare model
(1) with the less expressive standard stochastic block model (1) and a more expressive model due to Newman and Leicht (NL) [29].
The latter includes both individual and group eects as in (1), but instead of a single parameter for the overall activity or popularity
of a node, it features one such parameter per latent class, i.e. it models activity or popularity with respect to each class of nodes.
We compare the overlap between the expert classication of diseases and the one found algorithmically, based on the gene-disease
association network alone. We restricted ourselves to using the same number of classes for both genes and diseases. The comparison
of models (1), NL and the standard stochastic block model (2) is shown in gure 4a. As expected from the earlier discussion,
neglecting individual node eects as in model (2) reduces the overlap with an expert classication compared to model (1). But,
interestingly, the same applies when including gene-specic eects for every class of diseases and disease-specic eects for every
class of genes as in the NL model. Too many explanatory variables per individual node seem to reduce the detection quality of latent
classes.
Since 2005, the OMIM database has been steadily growing and 292 new associations between those 516 genes and 903 diseases
had been added until June 2010. Using the data from 2005 as a training set and these new additions as a test set, we compare the
predictive power of the dierent models for future associations. New entries to OMIM comprise both new variants of already known
gene-disease associations (repeated associations) as well as genuine new associations of genes with diseases that were not linked
previously. Hence, the data oers the opportunity to dierentiate predictive power with respect to these two types of entries. Using
the parameters estimated from the 2005 data set for each model (1), NL and (2), we calculate the probability for association of each
gene i with each disease as P(Di∣⃗). Then we sort these probabilities in descending order and hence obtain a candidate list for
new or repeated associations. For instance, in the case of models with 16 classes, gure 4B now shows how far we have to go down
the candidate list to nd a certain fraction of the associations that were added to the database over the course of 4 1/2 years.
Variants of already known associations seem to be added approximately randomly to the database as models (1), NL and (2)
all perform close the random expectation for these repeated associations. For the genuinely new associations, however, we observe
that all models strongly deviate from the random expectations. In particular (1) outperforms both NL and (2), with the latter two
performing similarly.
Comparing gures 4a and 4b, we conclude that the standard stochastic block model may be too simple to capture the biologically
relevant network structure as seen from a low overlap with an expert disease classication and low predictive power. With the inclusion
of node specic eects, the NL model is more
exible in capturing the observed network leading to a higher overlap with the expert
classication of diseases. It does, however, not generalize well as its predictive ability is still lower than that of the ERGM model
(1). The latter appears to provide the best compromise between
exibility and parsimony as it combines excellent ability to represent
biologically relevant structures with high classication accuracy and a parsimonious inclusion of node-specic eects, leading to the
best predictions.
III. DISCUSSION
We have presented an ecient, distributive algorithm that successfully estimates the parameters and latent group assignments of
an exponential random graph model including both node specic and group specic properties. We have shown that including node
specic eects in the estimation of latent classes leads to improved recovery of class assignments by domain experts. Additionally,
we have shown that using a simple dyadic model, a large part of the triad statistics in networks may be explained, shedding new
light on the discussion of motif distributions in complex networks. We expect our results to stimulate a discussion on the use of
appropriate null models in the analysis of sub-graph distributions and their universality for certain classes of networks. Finally, we
have explored the predictive power of the model to identify new gene-disease associations, using the OMIM database. Through these
specic examples, we have demonstrated that node specic and group specic properties should be both incorporated when inferring
and modeling structural features in complex networks.
We would like to thank M. Weigt, S. Bornholdt, D.R. White, and J.P. Crutcheld for stimulating discussions. J.R. thanks the
members of the Complexity Sciences Center at UC Davis for their hospitality.
This work was partially supported by the Volkswagen Foundation through a Fellowship Computational Sciences for J.R. and DAAD
[1] J. A. Dunne, R. J. Williams, and N. D. Martinez, Proc Natl Acad Sci USA 99, 12917 (2002).
[2] M. Girvan and M. E. J. Newman, Proc Natl Acad Sci USA 99, 7821 (2002).
[3] R. Sharan, I. Ulitsky, and R. Shamir, Molecular Systems Biology 3:88 (2007).
[4] S. Pinkert, J. Schultz, and J. Reichardt, PLoS Computational Biology 6, e1000659 (2010).
[5] R. Guimera and L. A. N. Amaral, Nature 433, 895 (2005).
[6] C. J. Honey, O. Sporns, L. Cammoun, X. Gigandet, J. P. thiran, R. Meuli, and P. Hagman, Proc Natl Acad Sci USA 106, 2035
(2009).
[7] A.-L. Barabasi and R. Albert, Science 286, 509 (1999).
[8] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chkloviskii, and U. Alon, Science 298, 824 (2002).
[9] M. E. J. Newman, Phys. Rev. E. 67, 026126 (2003).
[10] S. Fortunato, Physics Reports 486, 75 (2010).
[11] J. Reichardt and S. Bornholdt, Phys. Rev. Lett. 93, 218701 (2004).
[12] P. Doreian, V. Batagelj, and A. Ferligoj, Generalized Blockmodeling (Cambridge University Press, New York, NY, USA, 2005).
[13] J. Reichardt and D. R. White, Eur. Phys. J. B 60, 217 (2007).
[14] P. Holland and S. Leinhardt, J. Am. Stat. Assoc. 76, 33 (1981).
[15] S. Wasserman and P. Pattison, Psychometrica 61, 401 (1996).
[16] various authors, Soc. Networks 29 (2007).
[17] S. E. Fienberg and S. Wasserman, Categorical data analysis of single sociometric relations (Jossey-Bass, San Francisco, 1981),
pp. 156{192.
[18] P. W. Holland, K. B. Laskey, and S. Leinhardt, Soc. Networks 5, 109 (1983).
[19] Y. J. Wang and G. Y. Wong, J. Am. Stat. Assoc. 82, 8 (1987).
[20] G. Bianconi, P. Pin, and M. Marsili, Proc Natl Acad Sci USA 106, 11433 (2009).
[21] K. Nowicki and T. Snijders, J. Am. Stat. Assoc. 96, 1077 (2001).
[22] T. A. Snijders and K. Nowicki, Journal of Classication 14, 75 (1997).
[23] J.-J. Daudin, F. Picard, and S. Robin, Stat. Comput. 18, 173 (2008).
[24] R. Guimera and M. Sales-Pardo, Proc Natl Acad Sci USA (2009).
[25] P. J. Bickel and A. Chenb, Proc Natl Acad Sci USA 106, 21068 (2009).
[26] C. Kemp and J. B. Tenenbaum, in Proceedings of the 21st National Conference on Articial Intelligence (2006).
[27] E. M. Airoldi, D. M. Blei, S. Fienberg, and E. P. Xing, Journal of Machine Learning Research 9, 1981 (2008).
[28] M. Morup and L. K. Hansen, in NIPS Workshop on Analyzing Networks and Learning with Graphs (2009) (2009).
[29] M. Newman and E. Leicht, Proc Natl Acad Sci USA 104, 9564 (2007).
[30] P. D. Ho, A. E. Raftery, and M. S. Handcock, Journal of the American Statistical Association 97, 460 (2002).
[31] P. N. Krivitsky, M. S. Handcock, A. E. Raftery, and P. D. Ho, Soc. Networks 31, 204 (2009).
[32] M. S. Handcock, D. R. Hunter, C. T. Butts, S. M. Goodreau, and M. Morris, Journal of Statistical Software 24, 1 (2008).
[33] D. MacKay, Information Theory, inference and learning algorithms (Cambridge University Press, 2003).
[34] M. Opper and D. Saad, eds., Advanced Mean Field Methods (MIT Press, 2001).
[35] A. Davis, B. B. Gardner, and M. R. Gardner, Deep South: A social anthropological study of caste and class (University of
Chicago Press, 1941).
[36] L. C. Freeman, Dynamic Social Network Modeling and Analysis (The National Academies Press, 2003), chap. Finding Social
Groups: A Meta-Analysis of the Southern Women Data, pp. 39{77.
[37] B. L. Chen, D. H. Hall, and D. B. Chklovskii, Proc Natl Acad Sci USA 103, 4723 (2006).
[38] R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr, I. Ayzenshtat, M. Steer, and U. Alon, Science 303, 1538 (2004).
[39] M. Reigl, U. Alona, and D. B. Chklovkii, BMC Biology 2 (2004).
[40] K.-I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.-L. Barabasi, Proc Natl Acad Sci USA 104, 8685 (2007).
[41] A. L. Fred and A. K. Jain, Computer Vision and Pattern Recognition, IEEE Computer Society Conference on 2, 128 (2003).
[42] Y. Artzy-Randrup, S. J. Fleishman, N. Ben-Tal, and L. Stone, Science 305, 1107 (2004).
[43] A. Hamosh, A. F. Scott, J. S. Amberger, and C. A. B. an dVictor A. McKusick, Nucleic Acids Res. 33, D514 (2004).
[44] The reader wondering where the \exponential" notion is in our specication of ERGMs may rewrite the product of parameters
in equation (1) as exp(ai + b +Ci) where ai = lni/(1 − i), b = ln/(1 − ), and Ci = lnBi/(1 −Bi)
[45] The number of classes is determined a priori here. It represents a free parameter that denes the coarseness or resolution of the
grouping sought.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


