Inferring sparse Gaussian graphical models with latent structure
- DOI: 10.1214/08-EJS314
- arXiv: 0810.3177
Abstract
Our concern is selecting the concentration matrix's nonzero coefficients for a sparse Gaussian graphical model in a high-dimensional setting. This corresponds to estimating the graph of conditional dependencies between the variables. We describe a novel framework taking into account a latent structure on the concentration matrix. This latent structure is used to drive a penalty matrix and thus to recover a graphical model with a constrained topology. Our method uses an ell1 penalized likelihood criterion. Inference of the graph of conditional dependencies between the variates and of the hidden variables is performed simultaneously in an iterative textscem-like algorithm. The performances of our method is illustrated on synthetic as well as real data, the latter concerning breast cancer.
Author-supplied keywords
Inferring sparse Gaussian graphical models with latent structure
models with latent structure
Christophe Ambroise, Julien Chiquet and Catherine Matias
Laboratoire Statistique et Genome
523, place des Terrasses de l'Agora
91000 Evry, FRANCE
e-mail: christophe.ambroise; julien.chiquet; catherine.matias@genopole.cnrs.fr
url: http://stat.genopole.cnrs.fr
Abstract: Our concern is selecting the concentration matrix's nonzero
coecients for a sparse Gaussian graphical model in a high-dimensional
setting. This corresponds to estimating the graph of conditional depen-
dencies between the variables. We describe a novel framework taking into
account a latent structure on the concentration matrix. This latent struc-
ture is used to drive a penalty matrix and thus to recover a graphical model
with a constrained topology. Our method uses an `1 penalized likelihood
criterion. Inference of the graph of conditional dependencies between the
variates and of the hidden variables is performed simultaneously in an iter-
ative em-like algorithm. The performances of our method is illustrated on
synthetic as well as real data, the latter concerning breast cancer.
AMS 2000 subject classications: Primary 62H20, 62J07; secondary
62H30.
Keywords and phrases: Gaussian graphical model, Mixture model, `1-
penalization, Model selection, Variational inference, EM algorithm.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 A latent structure model for network inference . . . . . . . . . . . . . 7
2.1 Gaussian graphical models: general settings . . . . . . . . . . . . 7
2.2 Providing the network with a latent structure . . . . . . . . . . . 7
2.3 The complete likelihood . . . . . . . . . . . . . . . . . . . . . . . 8
3 Inference strategy by alternate optimization . . . . . . . . . . . . . . . 10
3.1 Variational estimation of the latent structure . . . . . . . . . . . 10
3.2 A Lasso-like method to estimate the concentration matrix . . . . 13
3.3 Choice of penalty parameters . . . . . . . . . . . . . . . . . . . . 17
4 Link with Meinshausen and Buhlmann's approach . . . . . . . . . . . 19
5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Breast Cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A Appendix section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.1 Proof of the equivalence between the constraints . . . . . . . . . 29
A.2 Fixed-point study . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.3 Proof of Lemma 2 (Lasso with pathwise coordinate optimization) 33
A.4 Reconstruction of the concentration matrix . . . . . . . . . . . . 34
A.5 Pseudo-likelihood of a Gaussian vector . . . . . . . . . . . . . . . 34
A.6 Penalization upper bound . . . . . . . . . . . . . . . . . . . . . . 35
1
ar
X
iv
:0
81
0.
31
77
v1
[
sta
t.M
E]
1
7 O
ct
20
08
1. Introduction
Estimating the concentration matrix (namely the inverse of the covariance ma-
trix) of a Gaussian vector in a sparse, high-dimensional setting has received
much attention recently. Graphical models provide a convenient setting for mod-
elling multivariate dependence patterns. In this framework, an undirected graph
is matched to the Gaussian random vector, where each vertex corresponds to
one coordinate of the vector, and an edge is not present between two vertices
if the corresponding random variables are independent, conditional on the re-
maining variables. Now, conditional independence between two coordinates of
the Gaussian random vector corresponds exactly to a zero entry in the concen-
tration matrix. Thus, detecting nonzero elements in the concentration matrix
is equivalent to reconstructing the Gaussian graphical model (GGM, see e.g.
Lauritzen 1996).
We focus here on the crucial problem of selecting the concentration matrix's
nonzero coecients. In other words, we focus on variable selection rather than
estimation. Application areas include gene-regulation graph inference in Biol-
ogy (using gene expression microarray data), as well as spectroscopy, climate
studies, functional magnetic resonance imaging, etc. We provide a very novel
approach driving the graph selection according to an unobserved modular struc-
ture on the vertices.
The idea of covariance selection rst appeared in the work of Dempster
(1972). In the so-called \large p, small n" setting (namely when the number
of observations is smaller than the dimension of the observed response), the
need for covariance selection is huge, as the empirical covariance matrix is no
longer regular.
In Drton and Perlman (2007), a classication of the dierent methods for
model selection/estimation in GGMs into three group types is suggested: constr-
aint-based methods, performing statistical tests; Bayesian approaches; and score-
based methods, maximizing a model-based criterion. The multiple testing prob-
lem has been taken into account in Drton and Perlman (2007; 2008). The authors
perform GGM covariance selection by multiple testing of hypotheses about van-
ishing partial correlation coecients. Such procedures may also be implemented
using the PC-algorithm (Kalisch and Buhlmann 2007). Starting from a com-
plete, undirected graph, the PC-algorithm deletes edges recursively, according
to conditional independence decisions. However, the statistical procedure in Dr-
ton and Perlman (2007) relies on asymptotic considerations, a regime never
attained in real situations.
Another attempt in this vein is to consider limited-order partial correlations
(Wille and Buhlmann 2006, Castelo and Roverato 2006). In Wille and Buhlmann
(2006) the authors consider only zero and rst-order conditional dependencies.
They argue that for sparse graphical models, these low-order dependencies still
re
ect reasonably well the full-order conditional dependency structure. More-
over, these dependencies may be well estimated even with a small number of
observations. In Castelo and Roverato (2006), the authors introduce a non-
rejection rate to reduce the multiple testing and computational problems to
which these approaches give rise.
A Bayesian framework is proposed in Dobra et al. (2004), Jones et al. (2005)
and their method was applied for evaluating patterns of association in large-scale
gene expression data. The approach is based on dependency networks, namely
a collection of conditional distributions fP(XijXni)g (where Xni stands for the
set of all variables but Xi). However, such conditional distributions will not in
general result in a coherent joint distribution. This point is further discussed
below concerning a similar problem appearing in Meinshausen and Buhlmann
(2006). Moreover, constructing priors on the set of concentration matrices is not
a trivial task (it mainly relies on Wishart priors). The use of MCMC procedures
limits the range of applications to moderate-sized networks.
Before focusing on score-based methods, let us rst introduce regularization
procedures. In the context of linear regression, the Lasso (least absolute shrink-
age and selection operator) technique was introduced by Tibshirani (1996). This
procedure performs model selection and parameter estimation at the same time.
The idea is that ordinary least-squares criterion may be improved in a sparse
context, using an `1-norm penalty. The `1-norm penalty shrinks the estimates
to zero while preserving the convexity of the optimization problem. Note that
the `1-norm penalization is also known as 'basis pursuit' in signal processing
(Chen et al. 2001).
It is well known that if the ultimate goal is parameter estimation, model
selection and estimation should be done in a single step. Performing the model
selection prior to parameter estimation in the selected model will, in fact, result
in a non-robust procedure. However, our primary focus here is on model selec-
tion, as we want to infer sparse networks. We therefore concentrate on model
selection rather than on estimation performances.
The Lars algorithm (Efron et al. 2004) is one of the most popular techniques
for solving the Lasso problem. It gives the path of solutions obtained when
varying the penalty parameter (the penalty parameter is used as a scaling factor
of the `1-norm penalty). The larger the penalty parameter, the sparser the
Lasso solution.
Using convex optimization techniques (see for instance Minoux 1986), the
Lasso problem may be stated as a primal problem, whose dual formulation
may be solved more easily. This approach is taken in Osborne et al. (2000b).
The authors obtain an iterative algorithm, the \homotopy method" (Osborne
et al. 2000a). Other very ecient approaches are based on focusing on each coor-
dinate iteratively. Indeed, for each coordinate, the Lasso problem is solved very
simply (assuming the other coordinates are xed) by soft-thresholding (Donoho
and Johnstone 1995). Thus, dierent 'coordinate optimization' procedures have
been proposed in the literature. Following the work of Fu (1998), a cyclic pro-
cedure is proposed in Friedman et al. (2007), where optimization with respect
to each coordinate is done iteratively; whereas Wu and Lange (2008) propose a
greedy approach, computing the solution for each coordinate and choosing that
which provides the largest decrease in a surrogate objective function. Note that
these approaches rely on the underlying assumption that the predictors for the
regression problem are uncorrelated.
Let us now come back to covariance (or concentration) matrix inference in
GGMs, using maximization of a model-based criterion.
Meinshausen and Buhlmann (2006) were the rst authors to apply Lasso
techniques for inferring a covariance matrix in a GGM. Their approach is to solve
p dierent Lasso regression problems, where p is the dimension of the observed
vector. The main drawback of such a procedure is that a symmetrization step is
required to obtain the nal network. It might, for instance, be the case that the
estimator of the regression coecient for Xi on Xj is zero, whereas the estimator
for Xj on Xi is not zero. Meinshausen and Buhlmann propose to use either an
\AND" or an \OR" nal step procedure to recover an undirected correlation
graph. However, these two procedures might result in dierent estimates and
there is no way of choosing between them. Moreover, as previously stated, a set
of conditional distributions does not necessarily cohere into a joint distribution.
Using a set of possibly non-coherent conditional distributions corresponds to
a pseudo-likelihood approach. This aspect was not underlined in Meinshausen
and Buhlmann (2006), and we clarify this point in Section 4.
Subsequently, two other articles, Banerjee et al. (2008) and Yuan and Lin
(2007) independently provided an improvement of the initial work of Mein-
shausen and Buhlmann (2006). In both works, the problem is seen as a penalized
maximum-likelihood (PML) problem. Instead of considering p dierent regres-
sion problems, these two articles focus on the likelihood of the Gaussian vector,
penalizing the entries of the concentration matrix with an `1-norm penalty.
They explain how the PML estimation may be solved as a \Lasso-like" prob-
lem. The major issue with PML strategies in the context of the concentration
matrix estimation of GGMs is to obtain a positive denite estimate. However,
the approach for solving the problem in Yuan and Lin (2007) is not suited to
high-dimensional settings, in contrast to the approach proposed in Banerjee
et al. (2008). In Yuan and Lin (2007), a non-negative garrote-type estimator is
used, and asymptotic properties (as n tends to innity while p is held xed)
are given. In Banerjee et al. (2008), two dierent algorithms are proposed for
solving problems in a high-dimensional setting. The rst approach relies on a
block-coordinate descent algorithm. The second is a semi-denite programming
algorithm, based on Nesterov's method, which is computationally intensive.
The next improvement in this vein comes with Friedman et al. (2008). Re-
lying on coordinate descent techniques, previously described in Friedman et al.
(2007), the authors revisit Banerjee et al.'s rst approach and propose an e-
cient algorithm to solve the PML estimation problem, under the positive denite
constraint. In fact, they use the block coordinate descent approach proposed by
Banerjee et al. (2008) and combine it with a second coordinate descent method.
Our method will make use of this approach.
To conclude this part, we remark that a completely dierent shrinkage esti-
mate was proposed by Schafer and Strimmer (2005) in the same context of large-
scale covariance matrix estimation. The approach consists in using a weighted
average of two dierent estimators, the rst being unconstrained (thus having
small bias but large variance), the second being low-dimensional (and thus ex-
hibiting small variance but large bias).
Now let us motivate the use of hidden structures in networks. Modularity
is a property observed in real (biological) networks (see for instance Ihmels
et al. 2002). Heterogeneity in the node behaviors is an important property of
these data. For example, so-called 'hubs' are highly-connected nodes, showing
a dierent behavior from the rest of the graph nodes. An interesting model
capturing these network features is a mixture model for random graphs (see
for instance Daudin et al. 2008). This model has been rediscovered many times
in the literature, and a non-exhaustive bibliography should include Frank and
Harary (1982), Snijders and Nowicki (1997), Nowicki and Snijders (2001), Tall-
berg (2005), Daudin et al. (2008), Mariadassou and Robin (2007), Zanghi et al.
(2008). To state it simply, this model assumes that each node belongs to some
unobserved group. Conditional on the node groups, the (weighted) edges are
independent and identically-distributed (i.i.d.) random variables, whose distri-
bution depends on the groups of the nodes to be connected. As we are interested
in GGMs, weighted edges correspond to entries of the concentration matrix.
In this work, we aim at estimating a hidden structure, namely node groups,
while discovering the network. This hidden structure should help us in choosing
adaptive penalty parameters. Indeed, we wish to penalize the elements of the
concentration matrix, according to the unobserved clusters to which the nodes
belong. For instance, if two nodes belong to the same unobserved group, we wish
to lower the penalty parameter acting on the corresponding entry in the concen-
tration matrix. Conversely, if we increase the penalty parameters on the entries
corresponding to nodes belonging to dierent groups, we shrink the estimated
coecient to zero. Our approach is completely new and improves inference of
sparse modular networks.
Another adaptive Lasso procedure is given in Zou (2006), whose idea is to
lower the bias of the large coecients by adapting the penalty parameter of each
coecient so that it automatically scales with the inferred value. It is known
that the non-adaptive Lasso procedure may result in inconsistent parameter
estimation. An illustration of the con
ict between optimal prediction and con-
sistent variable selection for the Lasso procedure is given in Meinshausen and
Buhlmann (2006). They proved that the optimal penalty parameter for predic-
tion gives inconsistent variable selection results, motivating the use of another
penalty parameter to ensure the control of the probability of falsely connecting
two or more distinct connectivity components of the graph. Like them, we also
focus on optimal selection rather than on optimal prediction. The adaptivity of
our procedure is not used for lowering the bias of large coecients, but instead
for constraining the prediction to t the underlying structure of the graph.
Model. Let us now brie
y describe the general approach of our work. The
model will be presented in detail in Section 2. Let X = (X1; : : : ; Xp)| be a
Gaussian random vector in Rp, with zero mean and positive denite covariance
matrix , namely X N (0p;): We observe independent and identically-
distributed (i.i.d) vectors (X1; : : : ; Xn) with the same distribution as X. The
matrix K = 1 is the concentration matrix of the model. Let S be the empirical
covariance matrix. The log-likelihood of the observations is given by
L(K) =
n
2
log det(K)
1
2
nX
k=1
(Xk)|KXk + c =
n
2
log det(K)
n
2
Tr(SK) + c
where c is a constant term.
The `1-penalized estimator proposed by Banerjee et al. (2008) is given by
bK = arg max
K0
log det(K) Tr(SK) kKk`1 ; (1)
where K 0 stands for positive deniteness, > 0 is a penalty parameter and
kKk`1 =
P
ij jKij j.
A natural generalization of this approach is to have dierent penalty param-
eters for dierent entries Kij . Namely,
log det(K) Tr(SK) k(K)k`1 ;
where (K) = (ij(Kij))i;j2P is a matrix of penalty functions acting on each
entry. As a general rule, using as many penalty functions as there are entries in
the concentration matrix to be estimated is not meaningful.
Here, we propose to take into account a hidden structure on the correlations
between the coordinates random variables Xk. Thus, we consider latent i.i.d.
random variables Z1; : : : ;Zp with values in a nite set f1; : : : ; Qg. Each variable
Zi describes the state of Xi, and we wish to adapt the penalty function ij with
respect to the states of Xi; Xj . More precisely, we wish to use a criterion of the
form
log det(K) Tr(SK) kZ(K)k`1 ;
where Z(K) = (ZiZj (Kij))i;j2P is a matrix of random penalty functions whose
entries depend on the latent structure Z = Z1; : : : ;Zp. However, the hidden
structure is not supposed to be known, thus we cannot rely on the previous
criteria. Intuitively, following the principle of Expectation-Maximization (em)
algorithm of Dempster et al. (1977), the idea will be to replace the unobserved
value Z(K) with its conditional expectation E(Z(K)jX
1; : : : ; Xn; K(m)) under
some model with parameter K(m), and iterate the following steps
(e) Compute E(Z(K)jX
1; : : : ; Xn; K(m))
(m) Update K(m+1) = argmaxK0 E(Z(K)jX
1; : : : ; Xn; K(m)).
One of our aims is to provide a very simple framework for such an analysis.
Note that the `1-norm used here acts on diagonal elements of the matrix K. It
is counter-intuitive to penalize diagonal elements of the concentration matrix,
as these do not re
ect sparsity in the correlation structure. However, from a
technical point of view, this strategy ensures that the procedure will select a
positive denite estimator (see Remark 2). This point was not emphasized in
the previous procedures using `1 penalized likelihood of GGMs.
Road-map. In Section 2 we present the model and the penalized maximum-
likelihood criterion on which we base our inference procedure, described in Sec-
tion 3. This procedure relies on a variational em algorithm, combined with a
Lasso-like procedure. Section 4 explains how Meinshausen and Buhlmann's
approach may be interpreted as a penalized pseudo-likelihood method. Finally,
Section 5 illustrates the performance of the method on synthetic data, for which
an R{package, SIMoNe (Statistical Inference for Modular Network), can be down-
loaded from the rst author's website. We also test our algorithm on a real data
set provided by Hess et al. (2006) and concerning n = 133 patients with breast
cancer treated using chemotherapy. According to Hess et al. (2006) and Natow-
icz et al. (2008), the patient response to the treatment can be classied either
as a pathologic complete response (pCR), or as a residual disease (not-pCR).
The prediction of the patient response is achieved accurately by studying the
expression levels of a limited number of genes (p = 26). Our algorithm is ap-
plied on each class of patients (pcR and not-pCR). Two distinct gene-regulatory
networks are thus inferred, showing a very dierent structure according to the
selected class of patients.
2. A latent structure model for network inference
In this section we present a framework for modelling heterogeneity among depen-
dencies between the variables. To this end, let us rst recall classical notations
from Gaussian Graphical Models (see Lauritzen 1996, for elementary results
about GGMs).
2.1. Gaussian graphical models: general settings
Let P = f1; : : : ; pg be a set of xed vertices, X = (X1; : : : ; Xp)| a random
vector describing a signal over this set and a sample (X1; : : : ; Xn) of size n with
the same distribution as X.
The vector X is assumed to be Gaussian with positive denite covariance
matrix = (ij)(i;j)2P2 . No loss of generality is involved when centering X,
so we may assume that X N (0p;). GGMs are based on a classical result,
originally emphasized by Dempster (1972), claiming that any couple of entries
(Xi; Xj) with i 6= j are independent conditional on all other variables indexed
by Pnfi; jg, if and only if the entry ( 1)ij is zero. The inverse of the covari-
ance matrix K = (Kij)(i;j)2P2 =
1, known as the concentration matrix, thus
describes the conditional independence structure of X. Moreover, each entry
Kij ; i 6= j is directly linked to the partial correlation coecient rijjPnfi;jg be-
tween variables Xi and Xj . In fact, we have rijjPnfi;jg = Kij=
p
KiiKjj , and
also Kii = Var(XijXPni) 1. Hence, after a simple rescaling, the matrix K can
be interpreted as the adjacency matrix of an undirected weighted graph G rep-
resenting the partial correlation structure between variables X1; : : : ; Xp. This
graph has no self-loop, with a random set of edges composed by all pairs (i; j)
such that Kij 6= 0. Note that we are seeking only pairs of vertices (i; j) such
that i < j, since there is no self-loop, and since Kij = Kji. Inferring nonzero
entries of K is equivalent to inferring G, and is therefore a highly relevant issue
in this framework.
2.2. Providing the network with a latent structure
Let us now extend the modeling by providing the network with an internal latent
structure.
The model proposed in Daudin et al. (2008) attempts a better t of data, as
it places the network G in the mixture framework, in order to take account of
the heterogeneity among vertices. The same general mixture model is adopted
here: vertices of P are distributed among a set Q = f1; : : : ; Qg of hidden clusters
that model the latent structure of the network. For any vertex i, the indicator
variable Ziq is equal to 1 if i 2 q and 0 otherwise, hence describing which cluster
the vertex i belongs to. A vertex is assumed to belong to one cluster only,
thus the random vector Zi = (Zi1; : : : ; ZiQ) obviously follows a multinomial
distribution. Namely,
Zi M(1;); (2)
where = (1; : : : ; Q) is a vector of cluster proportions, so that
P
q q = 1.
The concentration matrix structure. We shall now extend the clustering
of vertices from P to the concentration matrix K. Accordingly, both the exis-
tence and the weight of edges, described by the o-diagonal elements of K, will
depend on the cluster each vertex belongs to. Conditional on the events i 2 q
and j 2 ` where q; ` are clusters chosen from Q, each Kij (i 6= j) is a random
variable whose probability density function is denoted by fq`, that is,
Kij j fZiqZj` = 1g fq`(); i 6= j: (3)
It will be remarked that in this formulation the variables Kij are assumed to
be independent, conditional on the clusters the vertices belong to. Moreover,
we are considering only undirected graphs, so we may assume that fq` = f`q.
For technical reasons (see Remark 2), we also assume a prior distribution on
diagonal elements of K, namely
Kii f0():
Our suggestion is to adopt Laplace distributions; hence
8x 2 R; fq`(x) =
1
2q`
exp
jxj
q`
; and f0(x) =
1
20
exp
jxj
0
; (4)
where q`; 0 > 0 are scaling parameters and q` = `q. Below, the parameter
0 will be xed and not estimated.
The reason for choosing a Laplace distribution is that it is reminiscent of
the `1-norm, itself linked to Lasso-techniques for which appropriate tools are
available. In fact, when considering the general penalized least-square problem,
the penalty term can be seen as a log-prior density on the vector of parameters.
In the case of Lasso, the prior distribution corresponding to the `1-norm is
actually the Laplace distribution (see, e.g. Hastie et al. 2001).
The aliation model. The aliation model is a special case of network
structure (to be investigated below), where there are many dierent clusters,
but where the focus is restricted to two types of edges: edges between nodes
of the same cluster, and edges between nodes from dierent clusters. In the
aliation model the densities fq` in (4) are of only two kinds; that is, for all
q; ` 2 Q, let
fq` =
fqq = fin(;in) if q = `; the intra-cluster density of edges;
fq` = fout(;out) if q 6= `; the inter-cluster density of edges:
(5)
2.3. The complete likelihood
Having described the modeling of the network, we now focus on the inference
issue.
We denote as X the np matrix that contains the data-set fX1; X2; : : : ; Xng
row-wisely organized, i.e., (Xk)| is the kth row of X. Furthermore, we denote
as Z = fZiqgi2P;q2Q the set of all latent indicator variables for vertices. For the
sake of simplicity, the number of clusters Q and the parameters = (q)q2Q
and = fq`gq;`2Q are assumed to be known for the moment.
The data experiments X are the only observations available, and from these
we should like to be able to infer the graph G of conditional dependencies or,
equivalently, nonzero entries of K. As the matrix K has been given a prior
distribution, our aim is to maximize the posterior probability of K, given the
data X, or equivalently, the logarithm of the joint distribution. The estimate is
thus dened as follows:
bK = arg max
K0
P(KjX) = arg max
K0
logP(X;K);
where K 0 stands for positive-deniteness.
To solve this problem, we place ourselves in the classical complete-data frame-
work. The distribution of K is only known conditionally on the latent structure
described by Z. We denote as Z the set of all possible clusterings over nodes
from P. The marginalization over the latent clusters Z leads to
bK = arg max
K0
log
X
Z2Z
Lc(X;K;Z);
where the so-called complete-data likelihood Lc(X;K;Z) = P(X;K;Z) is the
function we shall develop using an em-like strategy hereafter. For this purpose,
a closed form of Lc is required.
Proposition 1. The following relation holds for the complete-data likelihood
Lc.
logLc(X;K;Z) =
n
2
(log det(K) Tr(SK)) kZ(K)k`1
X
i;j2P;i6=j
q;`2Q
ZiqZj` log(2q`) +
X
i2P;q2Q
Ziq logq + c; (6)
where S = n 1(X X)|(X X) is the empirical covariance matrix, c is a
constant term and Z(K) =
ZiZj (Kij)
i;j2P
is dened by
ZiZj (Kij) =
8
>>>><
>>>>:
X
q;`2Q
ZiqZj`
jKij j
q`
if i 6= j;
jKiij
0
otherwise:
(7)
Proof. Using the Bayes rule, Lc divides into three terms:
logLc(X;K;Z) = logP(X;K;Z) = logP(XjK) + logP(KjZ) + logP(Z);
where we make use of the fact that logP(XjK;Z) = logP(XjK).
The rst term is the likelihood associated with a size-n sample of a multi-
variate Gaussian distribution, since X N (0p;). Routine computations lead
to
logP(XjK) =
n
2
log det(K)
n
2
Tr(SK)
np
2
log(2):
As regards the second term, using the expression (4), we have
logP(KjZ) =
X
i;j2P;i6=j
q;`2Q
ZiqZj` log fq`(Kij) +
X
i2P
log f0(Kii)
=
X
i;j2P;i6=j
q;`2Q
ZiqZj`
jKij j
q`
+ log(2q`)
X
i2P
jKiij
0
p log(20):
From (2), we have logP(Z) =
P
i;q Ziq logq, and the result follows.
3. Inference strategy by alternate optimization
In the classical em framework developed by Dempster et al. (1977), where X
is the available data, inferring the unknown parameters K spread over a latent
structure Z would make use of the following conditional expectation:
Q
KjK(m)
= E
n
logLc(X;K;Z)
X; K(m)
o
=
X
Z2Z
P
Z
X;K(m)
logLc(X;K;Z) =
X
Z2Z
P
Z
K(m)
logLc(X;K;Z);
(8)
where K(m) is the estimation of K from the previous step of the algorithm.
The usual em strategy would be to alternate an E-step computing the con-
ditional expectation (8) with an M-step maximizing this quantity over the pa-
rameter of interest K. Unfortunately, no closed form of Q
KjK(m)
can be
formulated in the present case. The technical diculty lies in the complex de-
pendency structure contained in the model. Indeed, P(ZjK) cannot be factor-
ized, as argued in Daudin et al. (2008). This makes the direct calculation of
Q
KjK(m)
impossible. To tackle this problem we use a variational approach
(see, e.g., Jaakkola 2000, for elementary results on variational methods). In
this framework, the conditional distribution of the latent variables P(ZjK(m))
is approximated by a more convenient distribution denoted by Rm(Z), which
is chosen carefully in order to be tractable. Hence, our em-like algorithm deals
with the following approximation of the conditional expectation (8)
ERm flogLc(X;K;Z)g =
X
Z2Z
Rm(Z) logLc(X;K;Z): (9)
In the following section we develop a variational argument in order to choose
an approximation Rm(Z) of P(ZjK(m)). This enables us to compute the condi-
tional expectation (9) and proceed to the maximization step.
3.1. Variational estimation of the latent structure (E-step)
In this part, K is assumed to be known, and we are looking for an approximate
distribution R() of the latent variables. The variational approach consists in
maximizing a lower bound J of the log-likelihood logP(X;K), dened as follows:
J (X;K; R(Z)) = logP(X;K) DKL fR(Z)kP(ZjK)g (10)
where DKL is the Kullback-Leibler divergence. This measures the dierence
between the probability distribution P(jK) in the underlying model and its ap-
proximation R(). An intuitively straightforward choice for R() is a completely
factorized distribution (see Mariadassou and Robin 2007, Zanghi et al. 2008)
R (Z) =
Y
i2P
h i(Zi); (11)
where h i is the density of the multinomial probability distribution M(1; i),
and i = (i1; : : : ; iQ) is a random vector containing the variational param-
eters to optimize. The complete set of parameters = fiqgi2P;q2Q is what
we are seeking to obtain via the variational inference. In the case in hand the
variational approach intuitively operates as follows: each iq must be seen as
an approximation of the probability that vertex i belongs to cluster q, condi-
tional on the data, that is, iq estimates P(Ziq = 1jK), under the constraintP
q iq = 1. In the ideal case where P(ZjK) can be factorized as
Q
i P(ZijK)
and the parameters iq are chosen as iq = P(Ziq = 1jK), the Kullback-Leibler
divergence is null and the bound J reaches the log-likelihood.
The following proposition gives the form of the lower bound J to be maxi-
mized in order to estimate .
Proposition 2. Let us assume that R can be factorized as in (11), and let us
denote J (X;K) := J (X;K; R (Z)). Then J satises the following expres-
sion
J (X;K) = c
X
i2P
q2Q
iq log iq +
X
i2P
q2Q
iq logq
k (K)k`1
X
i;j2P;i6=j
q;`2Q
iqj` log 2q`; (12)
where c does not depend on and (K) = ( i j (Kij))i;j2P2 is dened simi-
larly as (7), replacing Ziq by iq.
Proof. Starting from (10), classical results on variational methods show that
J (X;K) = bQ (K) +H(R (Z));
where H(R ()) is the entropy of the distribution R () and bQ (K) is the ap-
proximation of the complete log-likelihood conditional expectation, computed
under the distribution R . Namely,
bQ (K) = ER flogLc(X;K;Z)g =
X
Z2Z
R (Z) logLc(X;K;Z): (13)
In the special case of factorized distribution (11), the entropy is
H(R (Z)) =
X
i2P
H(h i(Zi)) =
X
i2P;q2Q
iq log iq:
Moreover,
bQ (K) = logP(XjK) + ER [logP(KjZ)] + ER [logP(Z)]:
Equation (12) follows via Proposition 1, by using that ER (Ziq) = iq and
ER (ZiqZj`) = iqj`.
The optimal approximate distribution R is then derived by direct maxi-
mization of J . The following proposition gives the estimate b that solves the
problem.
Proposition 3. Let and be known. The following xed-point relationship
holds for the optimal variational parameters b = arg max J
biq _ q
Y
j2Pnfig
`2Q
1
2q`
exp
jKij j
q`
bj`
; (14)
where _ means that there is a scaling factor such that for any i 2 P, we have
P
q biq = 1.
Proof. This is just an adaptation to the Laplace case of Mariadassou and Robin
(2007, Proposition 3).
The initial value of is chosen using a classication algorithm such as spectral
clustering (see for instance Ng et al. 2002). As a consequence, the initial values
for iq lie in f0; 1g. We then use an iterative procedure setting b
(m+1) = g(b (m)),
where g is the function (implicitly dened above) for which b is a xed point.
Note that we cannot ensure uniqueness of the xed point for g, nor convergence
of this iterative procedure. In practice, we can always use a maximal number
of iterations, and if convergence has not occurred, we keep the initial value of
given by the clustering method. In appendix A.2 we explain that at least in
the aliation model (5), if the current values K(m)ij of the precision matrix are
small enough, and if the penalty parameters 1in and
1
out are well-chosen, then
uniqueness of the xed point is ensured. However, such a result does not hold
in the general case, which is one of the drawbacks of of the variational approach
in this context.
Estimation of and . The parameters and have been previously
considered as known to keep the statement as clear as possible.
Two dierent strategies may be used with respect to these parameters. The
rst approach is to x their values. Fixing the value of comes down to choos-
ing a priori the proportions of the groups, which is quite a common strategy
in mixture models. As for the choice of , this is equivalent to choosing the
penalty parameter in the classical lasso. Concerning general parameters , a
number of values need to be determined, which might be a problem. However
in the particular aliation model (5), only 2 parameters have to be xed: a pa-
rameter in that corresponds to a light penalty, since many intra-cluster edges
are expected, and another parameter out that ts with a heavier penalty, since
we do not expect many inter-cluster edges. This is typically the kind of strategy
that will be used for numerical applications (see Section 5). More generally, the
matrix penalty can be tuned to obtain a desired quantity of inferred edges, or
to constrain the topology of the graph, e.g. graphs with hubs.
The second strategy is to make use of the current inferred graph to estimate
the parameters. The basic idea is to include this estimation in the variational
method. Unfortunately, the maximization of J given in equation (12) with
respect to , and at the same time is not possible. To tackle this problem,
we use an alternate strategy. The parameter is computed with the xed-
point relationship (14) for xed values of and . Then we maximize J with
respect to and , once R is xed (that is, once is xed), as in the following
proposition. We successively iterate these two steps until stabilization.
Proposition 4. For xed values of , the parameters ^, ^ maximizing J are
given by
8q; ` 2 Q; ^q =
1
p
X
i2P
iq and ^q` =
P
i 6=j iqj`jKij j
P
i 6=j iqj`
:
Proof. Once terms that do not depend on the parameters of interest have been
removed from J , the problem becomes
^q = argmax
q
X
i
iq logq and ^q` = argmax
q`
X
i 6=j
iqj`
jKij j
q`
+ log 2q`
:
Null-dierentiation with respect to q (under the constraint
P
q q = 1) and
q` leads straightforwardly to the result.
3.2. A Lasso-like method to estimate the concentration matrix (the
M-step)
Now that we are able to compute the approximate conditional expectation
bQ (K) dened by (13), we wish to infer the concentration matrix K, assuming
is known. This is the aim of the m-step of our em{like strategy, that deals
with the maximization problem arg maxK0 bQ (K).
Using Proposition 1 and the equality ER (ZiqZj`) = iqj`, it is a simple
matter to rewrite the problem as follows
bK = argmax
K0
nn
2
(log det(K) Tr(SK)) k (K)k`1
o
: (15)
Hence, our m{step can be seen as a penalized maximum likelihood estimation
problem, exactly like in Friedman et al. (2008), Banerjee et al. (2008). The
likelihood considered here is P(XjK), that is, the likelihood which corresponds
to the n realizations of the Gaussian vector X for a given concentration matrix
K. The dierence of our approach lies in the complexity of the penalty term,
and in slight discrepancies as regards some constant factors.
Remark 1. Since we are using a penalty term 1=0 on matrix K's diagonal
elements, the solution to (15) satises
8i 2 P; bK 1ii = Sii + 2=(n0); (16)
when 10 < njSiij=2 for any i 2 P. Indeed, the sub-gradient equation is
n=2(K 1ii Sii) + sgn(Kii)=0 = 0, and Kii 0 since it is the inverse of a
conditional variance.
Let us now look at the solution of the m-step: the following proposition gives
an equivalent formulation of (15) that is more likely to be solved. The result
draws its inspiration from Banerjee et al. (2008).
Proposition 5. The maximization problem (15) over the concentration matrix
K is equivalent to the following, dealing with the covariance matrix
b = argmax
k( S)=Pk11
log det(); (17)
where
is the term-by-term division and
P = (P i j )i;j2P with P i j =
2n 1
P
q;` iqj`
1
q` i 6= j;
2(n0) 1 i = j:
Remark 2. By penalizing the diagonal terms of the concentration matrix K in
the initial problem, the set of matrices over which we maximize our criterion
contains, for instance, the matrix S+2=(n0)I, (where I stands for the identity
matrix). Thus, provided that the value of penalty parameter 1=0 is set su-
ciently high, this set contains positive denite matrices. This ensures that our
estimator is always invertible. Obviously, when S is invertible, which is usually
true for n greater or equal than p, penalizing the diagonal terms becomes futile.
In this case 1=0 is set to zero.
Proof. The penalty term in (15) can be written as follows
k (K)k`1 =
X
q;`2Q
X
i;j2P
i 6=j
jKij j
q`
iqj` +
X
i2P
jKiij
0
=
X
q;`2Q
kTq` ?Kk`1 ;
where ? is the term-by-term product. The set fTq`gq;`2Q contains p p sym-
metric matrices, dened, for each couple (q; `), by
Tq` = (Tq`;ij)i;j2P with 8i 6= j; Tq`;ij =
iqj`
q`
and Tq`;ii =
1
0Q2
:
Let us now use the fact that kAk`1 = maxkUk11 Tr(AU), for a given matrix
A. The optimization problem (15) can now be written as
max
K0
min
fUq`:kUq`k11g
8
<
:
n
2
log det K Tr
0
@
n
2
SK +
X
q;`2Q
(Tq` ?K) Uq`
1
A
9
=
;
;
since the trace operator is linear. The dual version of the above expression is ob-
tained by swapping max and min. The maximization is solved by dierentiating
with respect to K. To do this, we recall that in our specic case the matrices
T are symmetrical, and thus Tr ((T ?K)U) = Tr (K(T ?U)). Then, applying
the usual rules for the derivative of the trace operator, null-dierentiation with
respect to K yields
:= K 1 = S +
2
n
X
q;`2Q
(Uq` ?Tq`) : (18)
The dual problem therefore becomes
min
fUq`:kUq`k11g
n
n
2
log det()
np
2
o
;
or in other words,
max
fUq`:kUq`k11g
log det():
Finally, we need to write the constraint as a function of rather than the set
fUq`g. In fact, we simply need to show that
fUq`;8q; ` 2 Q; kUq`k1 1g =
;
( S)
P
1
1
;
which is straightforward (see Appendix A.1 for details).
To solve (17) and thus obtain the estimate b, we successively use two coordi-
nate descent methods. The rst corresponds to a block-wise strategy suggested
by Banerjee et al.. The second one is used to solve the resulting Lasso problem
and was suggested by Friedman et al. (2007).
Let us rst explain the block-wise strategy. For this purpose, we introduce
the following notation for b, S and the penalty matrix P
b =
"
b11 b12
b|12 b22
#
; S =
S11 s12
s|12 S22
; P =
P11 p12
p|12 P22
; (19)
where b11, S11 and P11 are (p 1)(p 1) matrices, b12, s12 and p12 are (p 1)
length column vectors and b22, S22 and P22 are real numbers. We have already
remarked (Remark 1) that the solution to (17) satises b22 = S22 + 2=(n0).
Moreover, using Schur complement, the vector b12 satises
b12 = argmin
fy:k(y s12)=p12k11g
n
y| b
1
11 y
o
: (20)
We have det(b) = det(b11)(b22 b
|
12
b
1
11 b12). The full matrix b is approx-
imated in the following way: rst, if required when p is greater than n, we
initialize the procedure with S+2=(n0)I, where 0 > 0 is chosen so as to make
S + 2=(n0)I invertible; secondly, we permute the columns (and thus the rows)
of b and iteratively solve problems like (20) until convergence of the procedure.
This convergence is ensured by the following lemma.
Lemma 1. The procedure which starts with a positive denite matrix and iter-
atively updates the columns and rows of this matrix according to the solutions
of (20) converges to the solution b of (17).
Proof. The proof relies on Banerjee et al. (2008, Theorem 3) and Tseng (2001,
Theorem 4.1). Convergence of block-coordinate descent methods is a well-documented
topic in convex optimization literature. Here, we have to bear in mind that using
`1-norm penalty leads to non-dierentiable functions. Thus, we rely on a result
by Tseng (2001, Theorem 4.1), which in our case ensures the convergence of the
procedure, provided there is at most one solution to each minimization problem
(20). This point is proved in Banerjee et al. (2008, Theorem 3).
Then, starting from a result given in Banerjee et al. (2008), an interpretation
of (20) as an `1{penalized problem is given in Friedman et al. (2008). This `1{
penalized problem is reminiscent of the Lasso and may thus be solved using a
coordinate descent strategy (Friedman et al. 2007). The following proposition
enunciates a result similar to those obtained in Banerjee et al. (2008, equation
(6)) and Friedman et al. (2008, equation (2.4)), although with a more general
penalty term and a factor 12 that diers. Since none of these articles gives an
explicit proof for this result, it is tting that we provide our own proof here.
Proposition 6. Solving (20) is equivalent to solving the dual problem
b = argmin
1
2
b
1=2
11 b
1=2
11 s12
2
2
+ kp12 ? k`1 ; (21)
where solution b12 to (20) and b to (21) are linked through
b12 = b11b=2: (22)
Proof. Problem (20) can be written as follows, by splitting the constraint:
8
<
:
miny y| b
1
11 y
subject to (p12)i yi (s12)i (p12)i 0; 8i = 1; : : : ; p 1;
or (p12)i yi + (s12)i (p12)i 0; 8i = 1; : : : ; p 1:
Let us introduce L the so-called Lagrangian, with vectors of Lagrange coe-
cients denoted by 1 = (1i )ip 1;
2 = (2i )ip 1 with nonnegative entries.
Also, let = 2 1. The Lagrange version of the above problem is
min
y
y| b
1
11 y + max
L()
; (23)
where, in the present case, L is given by
L() =
X
i
1i (yi (s12)i (p12)i) +
X
i
2i ( yi + (s12)i (p12)i) ;
The coecients 1i and
2
i maximizing L() are null when the constraints are
satised, and for each index i, at least one coecient among f1i ;
2
i g is zero.
Then
kk`1 =
X
i
jij =
X
i
1i +
2
i
:
Meanwhile, consider the dual problem of (23), swapping min and max: the
solution that minimizes the dual problem with respect to y satises the null-
gradient hypothesis. We obtain 2b
1
11 y = 0, that is y =
1
2
b11 (which
proves equation (22)). Introducing this result in the dual of (23), we get
max
1
4
| b11 + s
|
12
X
i
1i +
2
i
(p12)i;
also equivalent to
min
1
4
| b11 s
|
12 + kp12 ? k`1 :
Expressing this quantity by using the Euclidean norm achieves the proof.
Hence, the column b12 of the estimated covariance matrix b is computed by
solving the Lasso problem (21) using another coordinate descent method.
Lemma 2. The solution to (21) is computed by updating the jth coordinate of
b via
bj = 2S
0
@(s12)j
1
2
X
k 6=j
(b11)jk bk ; (p12)j
1
A =(b11)jj ; (24)
where S(x; ) = sgn(x)(jxj )+ is the soft-thresholding operator.
Moreover, the procedure which iteratively updates the entries of vector b12 =
b11b=2 according to the solutions b of (24) converges to the solution of (20).
Proof. The proof of this lemma is postponed to Appendix A.3.
Finally, the estimate of the matrix of concentration K is recovered by in-
verting b, which can be done at low computational cost (see appendix A.4 for
details). Hence, we solve the initial maximization problem (15) that denes the
m-step of our algorithm.
Implementation of the full em algorithm is outlined in Algorithm 1.
Algorithm 1: The full em{like algorithm
while bQ (bK(m)) has not stabilized do
//THE E-STEP: LATENT STRUCTURE INFERENCE
if m = 1 then
// First pass
Apply spectral clustering on the empirical covariance S to initialize b
else
Compute b with the xed-point relationship (14), using bK(m 1)
//THE M-STEP: NETWORK INFERENCE
Construct the penalty matrix P according to b
while b
(m)
has not stabilized do
for each column of b
(m)
do
Compute b12 by solving the lasso{like problem with path-wise coordinate
optimization
Compute bK(m) by block inversion of b
(m)
m m + 1
3.3. Choice of penalty parameters
As previously stated, the penalty parameters may be estimated in the e-step
of the algorithm (see subsection 3.1). However, this choice is not necessarily
optimal for the estimation of K, and other choices might in practice lead to a
better solution. A good strategy is to keep the estimated value of in the e-step
that leads to the estimation of , and to impose another value of during the
m-step. In this part, we indicate a possible choice for the penalty parameters to
use in the m-step, ensuring a small error on the connectivity components of the
estimated graph.
Let us rst introduce some notation. For any node i 2 P, let Ci denote the
connectivity component of node i in the true underlying conditional dependency
graph, and bCi the corresponding component resulting from the estimate bK of
this graph structure. The following proposition is based on Meinshausen and
Buhlmann (2006, Theorem 2) and Banerjee et al. (2008, Theorem 2).
Proposition 7. Fix some " > 0 and choose the penalty parameters such that,
for all q; ` 2 Q,
2p2Fn 2
0
@
2
nq`
max
i6=j
SiiSjj
1
2q`
! 1=2
(n 2)1=2
1
A "; (25)
where 1 Fn 2 is the c.d.f. of Student's t-distribution with n 2 degrees of
freedom. Then
P(9k; bCk * Ck) ": (26)
Proof. Here we simply indicate the main dierences between the proof of Baner-
jee et al. (2008, Theorem 2) and what is valid in our context. Note that according
to (15), the estimator bK must satisfy the following sub-gradient equation
8i 6= j;
n
2
bK 1ij Sij
0
@
X
q;`
ZiqZj`
q`
1
A ij = 0
where ij 2 sgn( bKij). Following the proof of Banerjee et al. (2008, Theorem 2),
we easily get
P(9k; bCk * Ck) p2 max
i2P;j =2Ci
P
0
@
n
2
jSij j
X
q;`
ZiqZj`
q`
1
A :
Performing some computations involving the correlation between variables Xi
and Xj , we also obtain
P(9k; bCk * Ck) 2p2 max
q;`2Q
Fn 2
0
@
2(n 2)1=2
nq`
max
i2P;j =2Ci
SiiSjj
1
2q`
! 1=2
1
A ;
which entails the conclusion.
Remark 3. Following Banerjee et al. (2008), note that in order to ensure (25),
it is enough to choose the penalty parameter such that, for all q; ` 2 Q,
q`(")
2
n
n 2 + t2n 2
"
2p2
1=2
max
i 6=j
SiiSjj
1=2
tn 2
"
2p2
1
;
where tn 2(u) is the (1 u)-quantile of Student's t-distribution with (n 2)
degrees of freedom, i.e. Fn 2(tn 2(u)) = u.
Remark 4. Inequality (25) does not take into account that dierent penalty
parameters are used for dierent hidden classes q; ` 2 Q. An adaptation of the
preceding strategy is to use current values Z(m) obtained from the probabilities
(m) of the hidden classes and to choose the current penalty parameters (m)
accordingly. More precisely, let us set, for instance
8i 2 P; Z(m)iq =
1 if q = argmax`
(m)
i`
0 otherwise:
Then, when
2p2Fn 2
0
B
B
B
@
2
n(m)q`
0
B
B
@ maxi6=j
Z(m)iq Z
(m)
j` =1
SiiSjj
1
((m)q` )
2
1
C
C
A
1=2
(n 2)1=2
1
C
C
C
A
"; (27)
for all q; ` 2 Q, the current estimate bK(m) of the dependency graph will approx-
imately satisfy (26). Moreover, in order to ensure (27), it is enough to choose,
for all q; ` 2 Q,
(m)q` (")
2
n
n 2 + t2n 2
"
2p2
1=2
0
B
B
@ maxi 6=j
Z(m)iq Z
(m)
j` =1
SiiSjj
1
C
C
A
1=2
tn 2
"
2p2
1
:
(28)
Typically, the kind of values obtained with (28) will lead to large penalties
and, consequently, to very sparse graphs: practically, more informative networks
can be obtained by replacing the term "=2p2 in (28) by greater values. In any
cases, (28) should be seen as a starting value.
4. Link with Meinshausen and Buhlmann's approach
We should also like to ll the gap between, on the one hand solving (15) and,
on the other, the approach proposed in Meinshausen and Buhlmann (2006),
where p independent penalized regression problems are solved using the Lasso.
In fact, we shall show that Meinshausen and Buhlmann's approach is equivalent
to maximizing the penalized pseudo log-likelihood corresponding to the size-
n sample of the multivariate Gaussian vector X on the set of non symmetric
matrices. Let us denote as eL this pseudo-likelihood, dened by
log eL(X; K) =
X
i2P
nX
k=1
logP(Xki jX
k
Pni; Ki)
!
;
where XkPni is the kth realization of the Gaussian vector X, once the ith coor-
dinate has been removed. In this section, the `1-norm of matrices is restricted
to o-diagonal elements only, that is, kAk`1 =
P
i 6=j jAij j.
Proposition 8. Consider the solution bKpseudo to the penalized pseudo-likelihood
problem
bKpseudo = argmax
fKij ;i6=jg
log eL(X; K) kP ?Kk`1 ; (29)
(whose diagonal is xed) and the solution bKMB given in Meinshausen and
Buhlmann (2006) to the p dierent regression problems, using the matrix penalty
2P=n. The two solutions have exactly the same null entries.
Proof. Denote by Knini and Snini, respectively, the matrices K and S once
their ith row and ith column have been removed. Moreover, Kini and Sini are
the ith rows of the matrices with the ith term removed. After some routine
computations, and using classical results for Gaussian multivariate vectors (see
Appendix A.5), it can be shown that
log eL(X; K) =
n
2
X
i2P
logKii KiiSii 2SiniKini
1
Kii
KiniSniniK
|
ini
+ c;
(30)
where c does not depend on K. Thus, if we forget the symmetry constraint
on K, maximizing the pseudo-likelihood (30) with respect to the non-diagonal
entries of K is equivalent to p independent maximization problems with respect
to each column K|ini. Consider, for instance, the last column of K, that is, for
i = p, and the relative term in (30). This term can be written as
n
2K22
2K22s
|
12K
|
ini + K
|
iniS11Kini
=
n
2K22
S
1=2
11 K
|
ini +K22S
1=2
11 s12
2
2
+ c0;
where we use the block-wise notation dened above (19). The term C 0 does not
depend on Kini, which is the current column of the concentration matrix to
infer. Namely, c0 = K222s
|
12S
1
11 s12.
Consider now the penalized version of the log-likelihood (29): we wish to solve
p penalized problems of minimization as dened above, which can be written as
follows
min
S
1=2
11 +K22S
1=2
11 s12
2
2
+
2K22
n
kp12 ? k`1 : (31)
Meinshausen and Buhlmann wish to solve p Lasso-problems, for instance for
the last variable p,
min
1
n
Xp Xnp
2
2
+
2n 1p12 ?
`1
; (32)
where Xp is the pth column of X and Xnp is the matrix of data the pth column
has been removed (note that we adapted the penalization term corresponding
to the framework developed here).
The minimum is reached in (31) for null-dierentiation, and we get
2S11 + 2K22s
|
12 +
2K22
n
p12 ? = 0;
where 2 sign(). The same for (32), and we get
2
n
X|npXnp
2
n
X|pXnp + 2n
1p12 ?
= 0;
where
2 sign(). Now, just note that n 1X|npXnp = S11 and n
1X|pXnp =
s|12, and problems (31) and (32) are equivalent, provided that = =K22.
Thus, the columns of the concentration matrix (with a removed diagonal
term) inferred from the penalized maximum pseudo-likelihood problem (29),
and those inferred with Meinshausen and Buhlmann's approach, share exactly
the same null-entries, that is, the same network of conditional dependencies.
5. Numerical experiments
In this section we present numerical experiments on both synthetic data, to in-
vestigate how well the proposed selection procedure behaves, and real data, to
demonstrate the practical use of GGM covariance selection with latent struc-
ture. In the remainder of this section we focus on an aliation model (5), the
choice of the penalty being made in line with Section 3.3. More precisely, we x
the ratio in=out = 1:2 and either let the value 1=in vary when considering
precision/recall curves for synthetic data, or x this parameter according to (28)
when dealing with real data.
5.1. Synthetic data
We perform numerical experiments to assess the performance of our approach
(SIMoNe, Statistical Inference for Modular Network) and compare it to already
existing methods for GGM covariance selection: GLasso (Friedman et al. 2008)
and GeneNet (Schafer and Strimmer 2005).
Data synthesis in our framework requires the simulation of a structured sparse
inverse covariance matrix. To this aim, we rst simulate a graph with an ali-
ation structure. We consider a simple binary aliation model where two types
of edges exist: edges between nodes of the same class and edges between nodes
of dierent classes. The binary incidence matrix of the graph is transformed
by randomly
ipping the sign of some elements in order to simulate both pos-
itively and negatively correlated variables. Positive deniteness of this matrix
is ensured by adding a large enough constant to the diagonal. The matrix is
then further normalized to have a diagonal of ones. A Gaussian sample of size
n with zero mean and the above covariance matrix is then simulated 50 times.
The results we present below are averaged over the 50 samples. At the end of
this section we discuss the performances of our method when there is no latent
structure on the data.
(a) (b) (c)
Fig 1. Simulation of the structured sparse concentration matrix. Adjacency matrix without
(a) and with (b) rows and columns reorganized according the aliation structure and corre-
sponding graph (c).
We simulate sparse graphs with p = 200 and n from 100 to 2000 (n=p 2
f1=2; 2; 3; 6; 10g). We use a probability of intra-cluster connection of 0:125, a
probability of inter-cluster connection of 0:0025, Q = 3 groups and equal group
proportions i = 1=3. With these settings, the theoretical expected number of
edges is about 862 and the total number of potential edges is 19900. A sample
graph is given in Figure 1. The running times of GLasso and SIMoNe are of the
same order. For the settings described above the running time varies from a few
seconds to a few minutes, according to the penalty parameter.
We focus the experiments on the ability to recover existing edges of the net-
work, that is the nonzero entries of the concentration matrix. This is a binary
decision problem where the compared algorithms are considered as classiers.
The decision made by a binary classier can be summarized using four num-
bers: True Positives (TP ), False Positive (FP ), True Negatives (TN) and False
Negatives (FN). We have chosen to draw precision/recall curves to display this
information and compare how well the methods perform (Figure 2).
Precision (TP=(TP + FP )) is the ratio of the number of true nonzero el-
ements to the total number of nonzero elements in the estimated concentra-
tion matrix bK. Recall that (TP=(TP + FN)) is the ratio of true nonzero el-
ements in bK to all nonzero entries of the real concentration matrix K. In a
sparse context where the number of actual positives (TP + FN) is small com-
pared to the number of actual negatives (FP + TN), precision/recall curves
give a more informative picture of an algorithm's performance than classical
Receiver Operator Characteristic (ROC) curves. Indeed, ROC curves plot the
False Positive Rate (FPR = FP=(FP + TN)) against the True Positive Rate
(TPR = TP=(TP + FN)). When the number of total positives is small com-
pared to the number of total negatives, small variations of FP and TP will
result in small variations of FPR and large variations of TPR, which is not
relevant for comparing performances. In a statistical framework, the recall is
equivalent to the power and the precision is equivalent to one minus the False
Discovery Proportion.
Additionally to the GLasso (Friedman et al. 2008) and GeneNet (Schafer and
Strimmer 2005) we consider two other procedures:
When n is greater than p, a straightforward way to obtain an estimate of
the inverse covariance matrix is to invert the empirical covariance matrix.
Although this approach is unlikely to perform well in a selection context
(since it is designed for estimation purposes), it is worth comparing it to
its competitors in order to assess the scale of improvement. We call this
procedure InvCor.
When the latent structure Z of the concentration matrix is known, our
method can be applied without its E-step and produce a relevant selec-
tion of the nonzero entries of the concentration matrix. This approach
represents the upper limit of our method, since it makes use of an usu-
ally unavailable source of information. This procedure is denoted perfect
SIMoNe.
In some problems the latent structure of the graph is partially known and
this information can be used in the E-step to improve the estimation of the
latent structure. For example, when inferring gene regulation networks, a
subset of identied genes may be known to belong to the same functional
module.
The approach of Meinshausen and Buhlmann (2006) was also tested. The
principle of this approach, and the performances obtained are close to those of
GLasso, but it was always slightly outperformed. We have therefore decided, for
the sake of brevity, to report only the four previously described procedures.
For the methods based on penalization (GLasso, SIMoNe and Perfect SIMoNe),
the precision/recall curves are plotted by varying the penalty parameter (namely
1=in in our case). The penalty parameter varies from close to zero to a maxi-
mum value which forces all o-diagonal elements of bK to be null (see Appendix
A.6). The GeneNet and InvCor methods are plotted by sorting the elements of
bK according to their absolute values, and choosing dierent thresholds to nd
nonzero entries.
Even when n is really greater than p (Figures 2 (a-b)) Invcor is always
dominated by the other methods from a selection point of view. This simple
check shows that even in a favorable context with abundant data, penalization
procedures improve the selection of nonzero entries of the concentration matrix,
in comparison with methods based on estimation of these entries.
Although GeneNet and GLasso can provide dierent results on a given run,
both methods perform similarly on average (50 runs for our experiment). The
only parameter we change in this experimental setting is the n=p ratio.
Perfect SIMoNe's curves dominate all other curves for any n=p ratio. This
clearly shows that the knowledge of the structure provides a valuable informa-
tion for selecting the nonzero entries of the concentration matrix. When the
structure is hidden, the main problem of our approach is then to nd a reliable
estimate of this structure from the initial data.
Perfect SIMoNe and SIMoNe perform equivalently when n = 10p and when
the ratio n=p decreases, Perfect SIMoNe tends to outperform SIMoNe more
clearly. This means that SIMoNe is able to recover the latent structure when
there is enough data, but does not nd a substantial structure when n drops
below p.
When p > n, the empirical covariance matrix ceases to be invertible. Thus,
Figures 2 (e-f) do not display the InvCor results. Although it is possible to show
that both GLasso and SIMoNe increase the number of inferred true nonzero
elements with the number of iterations in all settings, precision/recall curves
show the relative poor performances for all tested algorithms when p n.
Notice that when p > n, the estimated latent structure is not reliable. Never-
theless, the performance of SIMoNe remains comparable to that of GLasso. We
can therefore see that assuming the existence of a latent structure when there
is none does not impair the selection of nonzero entries of the matrix K.
5.2. Breast Cancer data
We tested our algorithm on a gene expression data set provided by Hess et al.
(2006) and concerning 133 patients with stage I III breast cancer. The pa-
tients were treated with chemotherapy prior to surgery. Patient response to the
treatment is classied as either a pathologic complete response (pCR) or a resid-
ual disease (not-pCR). Hess et al. (2006) and Natowicz et al. (2008) developed
and tested a multigene predictor for treatment response on this data set. They
focused on a set of 26 genes having a high predictive value (see Table 1). We
thus consider a total of n = 133 cases containing p = 26 gene expression levels.
When dealing with gene regulatory networks, we typically observe n inde-
pendent microarray experiments, each giving the expression levels of the same
p genes. If the same experimental conditions are used for all microarrays, these
may be considered as a sample of the same experiment. In the application in
question, cases from the pCR class (34 cases) and from the not-pCR class (99
cases) clearly do not have the same distribution. We apply our algorithm on
each class of patients. Two distinct gene regulatory networks are thus inferred.
Figure 3 plots the resulting networks obtained for three dierent penaliza-
tions. The penalization parameters were heuristically chosen from the number of
expected nonzero entries. We used Q = 2 latent clusters, and it is interesting to
note that when assuming more than two clusters, the algorithm systematically
produces exactly two non-empty clusters.
The inferred networks exhibit very dierent structures according to the class
of patients. This in itself is interesting and suggests that gene regulation diers
with respect to the presence or absence of a pCR.
The network obtained with not-pCR cases displays a two-star pattern. Each
star connects to a unique gene, either SCUBE2 or IGFBP4. Almost all the most
signicant connections imply SCUBE2. This star pattern suggests that further
studies of this particular gene would be of interest for understanding residual
disease.
The network estimated with the pCR cases has a dierent two-cluster struc-
ture. In particular, it groups IGFBP4 and SCUBE2 in the same cluster with a
direct signicant link. This again indicates a completely dierent relationship
between the genes in pCR versus non-pCR.
Gene symbol Gene name
MAPT Microtubule-associated protein
BBS4 Bardet-Biedl syndrome 4
THRAP2 Thyroid hormone receptor associated protein 2
MBTP-S1 Hypothetical protein
PDGFRA Human clone 23,948 mRNA sequence
ZNF552 Zinc nger protein 552
RAMP1 Receptor (calcitonin) activity modifying protein 1
BECN1 Beclin 1 (coiled-coil, myosin-likeBCL2 interacting protein)
BTG3 BTG family, member 3
SCUBE2 Signal peptide, CUB domain,EGF-like 2
MELK Maternal embryonic leucine zipper kinase
AMFR Autocrine motility factor receptor
CTNND2 Catenin, delta 2
GAMT Guanidinoacetate N-methyl transferase
CA12 Carbonic anhydrase XII
FGFR1OP FGFR1 oncogene partner
KIAA1467 KIAA1467 protein
MTRN Meteorin, glial cell dierentiation regulator
FLJ10916 Hypothetical protein FLJ10916
E2F3 E2F transcription factor 3
ERBB4 V-erb-a erythroblastic leukemiaviral oncogene homolog 4(avian)
JMJD2B Jumonji domain containing 2B
RRM2 Ribonucleotide reductase M2polypeptide
FLJ12650 Hypothetical protein FLJ12650
GFRA1 GDNF family receptor 1
IGFBP4 Insulin-like growth factor binding protein 4
Table 1
The key genes that composed the inferred networks.
not-pCR pCR
L
ow
p
en
al
ty
AMFR
BB_S4
BECNI BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650 GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SI
MELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
AMFR
BB_S4 BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4JMJD2B
KIA1467
MAPT
MBTP_SI
MELK
METRN
PDGFRA RAMPI
RRM2SCUBE2
THRAP2
ZNF552
M
ed
iu
m
p
en
al
ty
AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SI
MELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SI
MELKMETRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
H
ig
h
p
en
al
ty
AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SIMELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2THRAP2
ZNF552 AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT MBTP_SI
MELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
Fig 3. Inferred graphs for three dierent penalization's levels.
References
O. Banerjee, L. El Ghaoui, and A. d'Aspremont. Model selection through sparse
maximum likelihood estimation for multivariate Gaussian or binary data. J.
Mach. Learn. Res., 9:485{516, 2008.
R. Castelo and Alberto Roverato. A robust procedure for Gaussian graphical
model search from microarray data with p larger than n. J. Mach. Learn.
Res., 7:2621{2650, 2006.
Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic
decomposition by basis pursuit. SIAM Rev., 43(1):129{159 (electronic), 2001.
ISSN 0036-1445. Reprinted from SIAM J. Sci. Comput. 20 (1998), no. 1,
33{61 (electronic) [ MR1639094 (99h:94013)].
J.-J. Daudin, F. Picard, and S. Robin. A mixture model for random graphs.
Stat. Comput., 18(2):173{183, 2008.
A. P. Dempster. Covariance selection. Biometrics, Special Multivariate Issue,
28:157{175, 1972.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1):
1{38, 1977. ISSN 0035-9246. With discussion.
Adrian Dobra, Chris Hans, Beatrix Jones, Joseph R. Nevins, Guang Yao, and
Mike West. Sparse graphical models for exploring gene expression data. J.
Multivariate Anal., 90(1):196{212, 2004.
David L. Donoho and Iain M. Johnstone. Adapting to unknown smoothness via
wavelet shrinkage. J. Amer. Statist. Assoc., 90(432):1200{1224, 1995.
Mathias Drton and Michael D. Perlman. Multiple testing and error control in
gaussian graphical model selection. Statist. Sci., 22:430, 2007.
Mathias Drton and Michael D. Perlman. A SINful approach to gaussian graph-
ical model selection. J. Statist. Plann. Inference, 138(4):1179{1200, 2008.
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least
angle regression. Ann. Statist., 32(2):407{499, 2004. With discussion, and a
rejoinder by the authors.
Ove Frank and Frank Harary. Cluster inference by using transitivity indices
in empirical graphs. J. Amer. Statist. Assoc., 77(380):835{840, 1982. ISSN
0162-1459.
J. Friedman, T. Hastie, H. Ho
ing, and R. Tibshirani. Pathwise coordinate
optimization. The Annals of Applied Statistics, 1(2):302{332, 2007.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation
with the graphical lasso. Biostatistics, 9(3):432{441, 2008.
W.J. Fu. Penalized regressions: the bridge versus the lasso. Journal of Compu-
tational and Graphical Statistics, 7(3):397{416, 1998.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer, 2001.
K.R. Hess, K. Anderson, W.F. Symmans, V. Valero, N. Ibrahim, J.A. Mejia,
D. Booser, R.L. Theriault, U. Buzdar, P.J. Dempsey, R. Rouzier, N. Sneige,
J.S. Ross, T. Vidaurre, H.L. Gomez, G.N. Hortobagyi, and L. Pustzai. Phar-
macogenomic predictor of sensitivity to preoperative chemotherapy with pa-
clitaxel and
uorouracil, doxorubicin, and cyclophosphamide in breast cancer.
Journal of Clinical Oncology, 24(26):4236{4244, 2006.
Jan Ihmels, Gilgi Friedlander, Sven Bergmann, Ofer Sarig, Yaniv Ziv, and
Naama Barkai. Revealing modular organization in the yeast transcriptional
network. Nature Genetics, pages 370{377, July 2002.
Jaakkola. Advanced mean eld methods: theory and practice. MIT Press, 2000.
Beatrix Jones, Carlos Carvalho, Adrian Dobra, Chris Hans, Chris Carter, and
Mike West. Experiments in stochastic computation for high-dimensional
graphical models. Statist. Sci., 20(4):388{400, 2005.
Markus Kalisch and Peter Buhlmann. Estimating high-dimensional directed
acyclic graphs with the pc-algorithm. J. Mach. Learn. Res., 8:613{636, Mar
2007.
Steen L. Lauritzen. Graphical models, volume 17 of Oxford Statistical Science
Series. The Clarendon Press Oxford University Press, New York, 1996. ISBN
0-19-852219-3. Oxford Science Publications.
M. Mariadassou and S. Robin. Uncovering latent structure in valued graphs:
a variational approach. Technical Report 10, Statistics for Systems Biology,
2007.
Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and vari-
able selection with the lasso. Ann. Statist., 34(3):1436{1462, 2006.
M. Minoux. Mathematical programming. Theory and algorithms. John Wiley
and Sons, 1986.
R. Natowicz, R. Incitti, E.G. Horta, B. Charles, P. Guinot, K. Yan, C. Coutant,
F. Andre, and R. Pusztai, L. Rouzier. Prediction of the outcome of a preop-
erative chemotherapy in breast cancer using dna probes that provide infor-
mation on both complete and incomplete response. BMC Bioinformatics, 9
(149), 2008.
A.Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an
algorithm. In NIPS 14, 2002.
Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for
stochastic blockstructures. J. Amer. Statist. Assoc., 96(455):1077{1087, 2001.
ISSN 0162-1459.
M. R. Osborne, Brett Presnell, and B. A. Turlach. A new approach to variable
selection in least squares problems. IMA J. Numer. Anal., 20(3):389{403,
2000a.
Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the LASSO
and its dual. J. Comput. Graph. Statist., 9(2):319{337, 2000b.
Juliane Schafer and Korbinian Strimmer. A shrinkage approach to large-scale
covariance matrix estimation and implications for functional genomics. Sta-
tistical Applications in Genetics and Molecular Biology, 4(1), 2005.
Tom A. B. Snijders and Krzysztof Nowicki. Estimation and prediction for
stochastic blockmodels for graphs with latent block structure. J. Classi-
cation, 14(1):75{100, 1997. ISSN 0176-4268.
C. Tallberg. A Bayesian approach to modeling stochastic blockstructures with
covariates. Journal of Mathematical Sociology, 29(1):1{23, 2005.
Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Roy.
Statist. Soc. Ser. B, 58(1):267{288, 1996.
P. Tseng. Convergence of a block coordinate descent method for nondieren-
tiable minimization. J. Optim. Theory Appl., 109(3):475{494, 2001.
Anja Wille and Peter Buhlmann. Low-order conditional independence graphs for
inferring genetic networks. Statistical Applications in Genetics and Molecular
Biology, 5(1), 2006.
Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso
penalized regression. Ann. Appl. Stat., 2(1):224{244, 2008.
Ming Yuan and Yi Lin. Model selection and estimation in the Gaussian graphical
model. Biometrika, 94(1):19{35, 2007.
H. Zanghi, F. Picard, V. Miele, and C. Ambroise. Strategies for online inference
of network mixture. Technical Report 14, Statistics for Systems Biology, 2008.
Hui Zou. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc.,
101(476):1418{1429, 2006.
Appendix A: Appendix section
A.1. Proof of the equivalence between the constraints
When kUq`k1 1, we have for each couple (i; j) 2 P
2,
( S)ij
=
2
n
X
q;`
j(Uq`)ij (Tq`)ij j
2
n
X
q;`
Tq`;ij = P i j :
Thus kUq`k1 1) k( S) =Pk1 1.
On the other hand, assume that k( S) =Pk1 1, that is, for all i; j 2
P, we have
P i j ( S)ij P i j :
This also means that there exists some ij 2 [0; 1] such that
( S)ij = ijP i j + (1 ij)( P i j ) =
2
n
X
q;`
(2ij 1)Tq`;ij :
We choose Uql such that (Uql)ij = (2ij 1) for all q; ` 2 Q. Then, since
ij 2 [0; 1], we have
1 (Uq`)ij 1; 8i; j 2 P;
which proves that k( S) =Pk1 1) kUq`k1 1.
A.2. Fixed-point study
Let us rst introduce some notation. For any i; j 2 P and any q; ` 2 Q, consider
the random variables
Lijq` =
jKij j
q`
+ log 2q`:
Let u : RpQ ! RpQ be dened by its coordinate functions u = (uiq)i2P;q2Q in
the following way
8a = (aiq)i2P;q2Q 2 RpQ;
uiq(a) = q exp
n
X
j 6=i
X
`
aj`Lijq`
o
= q exp
n
X
j 6=i
X
`
aj`
jKij j
q`
+ log 2q`
o
;
and let g = (giq)i2P;q2Q : RpQ ! RpQ satisfy
8a 2 RpQ; giq(a) =
uiq(a)
P
` ui`(a)
:
According to Proposition 3, the optimal parameter b is a xed-point of g.
Now, let
=
n
a = (aiq)i2P;q2Q 2 RpQ;8i 2 P; q 2 Q; aiq 2 [0; 1] and
X
q
aiq = 1
o
:
We wish to study the xed-points of g in . First, let us note that as is a
compact state space and as the function g satises g : ! and is continuous,
the existence of a xed-point of g follows from Brouwer's Theorem.
We now restrict our attention to a smaller set than the whole state space .
For any " > 0, let
" =
n
a 2 ;8i 2 P; q 2 Q; aiq 2 ["; 1 "]
o
:
Note that we do not claim that g : " ! ". However, the existence of a xed-
point of g is ensured in and if we assume q > 0 for any q 2 Q (which is a
reasonable assumption if the number of classes Q is not too large), it can easily
be seen that any xed-point satises aiq > 0, for any i 2 P and any q 2 Q.
Thus for suciently small " > 0, the xed-points of g belong to ".
In order to study the behaviour of g in the vicinity of a xed-point, we need
to look at some kind of contraction property for g. To this end we introduce a
distance d on " that will make use of the form of the state space ". For all
a; b 2 ", denote by ai = (aiq)q2Q 2 RQ and bi = (biq)q2Q 2 RQ. Moreover, let
d(a; b) = max
i2P
d0(ai; bi) = max
i2P
log
maxq2Q aiq=biq
minq2Q aiq=biq
= max
i2P
max
q;`2Q
log
aiqbi`
biqai`
:
It is well known that d0 is a distance in ["; 1 "]Q, and it is easy to check that
the resulting d is also a distance in ".
Now, x a; b 2 RpQ and consider the distance d(g(a); g(b)). It is easily checked
that
d(g(a); g(b)) = max
i2P
d0(gi(a); gi(b)) = max
i2P
d0(ui(a); ui(b)) = max
i2P
d0(ui(a); ui(b));
where u = (ui)i2P = (uiq)i2P;q2Q is dened in the following way
8a = (aiq)i2P;q2Q 2 RpQ;
uiq(a) = exp
nX
j 6=i
X
`
aj`Lijq`
o
= exp
nX
j 6=i
X
`
aj`
jKij j
q`
+ log 2q`
o
:
In the following, x " > 0 and a; b 2 " and denote by
8i 2 P; ci1 = min
q2Q
aiq
biq
; ci2 = max
q2Q
aiq
biq
:
With these notations, we have
d(a; b) = max
i2P
d0(ai; bi) = max
i2P
log
ci2
ci1
: (33)
We only consider the aliation model described in (5). Thus, there are only
two dierent values for q`, namely in and out for intra and extra cluster
connectivity.
Lemma 3. If for any i; j 2 P; i 6= j and any 2 fin; outg, we have
0 <
jKij j
+ log 2 <
"
2(p 1)(1 + ")
almost surely, (34)
then the function g satises a contraction property on ".
Before proving the lemma, let us explain the consequences of this result.
Consider the function hK dened on (0;+1) by
hK() =
jKj
+ log 2:
This function rst decreases from +1 to the value 1 + log 2jKj on the interval
(0; jKj) and then increases from 1 + log 2jKj to +1 on (jKj;+1).
At any step of the algorithm, if the current values K(m)ij of the concentration
matrix are small enough, namely smaller than 1=(2e) ' 0:184 then the functions
h
K(m)ij
take all the values between 1+log 2jKj < 0 and +1. Thus, there is room
for choosing in; out such that (34) is satised. In such a case, the xed-point
we are looking for is unique and the iterative procedure setting b (s+1) = g(b (s))
converges.
Proof. Using that for any j 2 P and any ` 2 Q, we have cj1bj` aj` c
j
2bj`
and Lijq` > 0, we get
exp
0
@
X
j 6=i
cj1
X
`
bj`Lijq`
1
A uiq(a) exp
0
@
X
j 6=i
cj2
X
`
bj`Lijq`
1
A :
Thus, it follows
exp
0
@
X
j 6=i
(cj1 1)
X
`
bj`Lijq`
1
A
uiq(a)
uiq(b)
exp
0
@
X
j 6=i
(cj2 1)
X
`
bj`Lijq`
1
A :
(35)
In the case of the aliation model, for xed i; j 2 P and q 2 Q, the set of
random variables fLijq`g`2Q is reduced to only two random values, namely
Linij =
jKij j
in
+ log 2in; L
out
ij =
jKij j
out
+ log 2out:
For the sake of simplicity, we assume Q = 2 groups (our arguments may be
easily generalized to 3 groups or more). Now, denoting Lmaxij = max(L
in
ij ; L
out
ij )
and Lminij = min(L
in
ij ; L
out
ij ), it can easily be seen that (for " < 1=2),
sup
b2"
X
`
bj`Lijq` = (1 ")L
max
ij + "L
min
ij
inf
b2"
X
`
bj`Lijq` = (1 ")L
min
ij + "L
max
ij ;
almost surely. Note that if we have Q 3 groups, explicit bounds can also be
obtained (their expression is only slightly more complicated). Coming back to
(35), we get
exp
0
@
X
j 6=i
(cj1 1)f(1 ")L
min
ij + "L
max
ij g
1
A
uiq(a)
uiq(b)
exp
0
@
X
j 6=i
(cj2 1)f(1 ")L
max
ij + "L
min
ij g
1
A :
This leads to
d0(ui(a); ui(b)) = log
maxq2Q uiq(a)=uiq(b)
minq2Q uiq(a)=uiq(b)
X
j 6=i
(cj2 1)f(1 ")L
max
ij + "L
min
ij g
X
j 6=i
(cj1 1)f(1 ")L
min
ij + "L
max
ij g
X
j 6=i
Lmaxij fc
j
2 1 "(c
j
2 + c
j
1 2)g+ L
min
ij f1 c
j
1 + "(c
j
2 + c
j
1 2)g:
Finally, recall that d(g(a); g(b)) = maxi d0(ui(a); ui(b)), leading to
d(g(a); g(b)) max
i2P
n
ci2 1 "(c
i
2 + c
i
1 2)
_
1 ci1 + "(c
i
2 + c
i
1 2)
o
max
i2P
X
j 6=i
(Lmaxij + L
min
ij ):
Now, using the inverse triangle inequality, and the fact that ci1 1 c
i
2, we get
for any i 2 P,
jci2 + c
i
1 2j =
jci2 1j j1 c
i
1j
jci2 c
i
1j = c
i
2 c
i
1:
Moreover, we have 0 ci2 1 c
i
2 c
i
1 and 0 1 c
i
1 c
i
2 c
i
1. This leads to
d(g(a); g(b)) (1 + ") max
i2P
(ci2 c
i
1)max
i2P
X
j 6=i
(Lmaxij + L
min
ij )
(1 + ") max
i2P
(ci2 c
i
1) 2(p 1) max
j 6=i
Lmaxij : (36)
Since a and b belong to ", we get that ci1; c
i
2 2 ["; "
1] and thus
ci2 c
i
1 = exp(log c
i
2) exp(log c
i
1)
1
"
log
ci2
ci1
:
In particular, recalling (33), we have
0 max
i2P
ci2 c
i
1
1
"
d(a; b):
Coming back to (36), we get
d(g(a); g(b)) (1 + "


