Sign up & Download
Sign in

Inferring sparse Gaussian graphical models with latent structure

by Christophe Ambroise, Julien Chiquet, Catherine Matias
Electronic Journal of Statistics (2008)

Abstract

Our concern is selecting the concentration matrix's nonzero coefficients for a sparse Gaussian graphical model in a high-dimensional setting. This corresponds to estimating the graph of conditional dependencies between the variables. We describe a novel framework taking into account a latent structure on the concentration matrix. This latent structure is used to drive a penalty matrix and thus to recover a graphical model with a constrained topology. Our method uses an ell1 penalized likelihood criterion. Inference of the graph of conditional dependencies between the variates and of the hidden variables is performed simultaneously in an iterative textscem-like algorithm. The performances of our method is illustrated on synthetic as well as real data, the latter concerning breast cancer.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Inferring sparse Gaussian graphical models with latent structure

Inferring sparse Gaussian graphical
models with latent structure
Christophe Ambroise, Julien Chiquet and Catherine Matias
Laboratoire Statistique et Genome
523, place des Terrasses de l'Agora
91000 Evry, FRANCE
e-mail: christophe.ambroise; julien.chiquet; catherine.matias@genopole.cnrs.fr
url: http://stat.genopole.cnrs.fr
Abstract: Our concern is selecting the concentration matrix's nonzero
coecients for a sparse Gaussian graphical model in a high-dimensional
setting. This corresponds to estimating the graph of conditional depen-
dencies between the variables. We describe a novel framework taking into
account a latent structure on the concentration matrix. This latent struc-
ture is used to drive a penalty matrix and thus to recover a graphical model
with a constrained topology. Our method uses an `1 penalized likelihood
criterion. Inference of the graph of conditional dependencies between the
variates and of the hidden variables is performed simultaneously in an iter-
ative em-like algorithm. The performances of our method is illustrated on
synthetic as well as real data, the latter concerning breast cancer.
AMS 2000 subject classi cations: Primary 62H20, 62J07; secondary
62H30.
Keywords and phrases: Gaussian graphical model, Mixture model, `1-
penalization, Model selection, Variational inference, EM algorithm.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 A latent structure model for network inference . . . . . . . . . . . . . 7
2.1 Gaussian graphical models: general settings . . . . . . . . . . . . 7
2.2 Providing the network with a latent structure . . . . . . . . . . . 7
2.3 The complete likelihood . . . . . . . . . . . . . . . . . . . . . . . 8
3 Inference strategy by alternate optimization . . . . . . . . . . . . . . . 10
3.1 Variational estimation of the latent structure . . . . . . . . . . . 10
3.2 A Lasso-like method to estimate the concentration matrix . . . . 13
3.3 Choice of penalty parameters . . . . . . . . . . . . . . . . . . . . 17
4 Link with Meinshausen and Buhlmann's approach . . . . . . . . . . . 19
5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Breast Cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A Appendix section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.1 Proof of the equivalence between the constraints . . . . . . . . . 29
A.2 Fixed-point study . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.3 Proof of Lemma 2 (Lasso with pathwise coordinate optimization) 33
A.4 Reconstruction of the concentration matrix . . . . . . . . . . . . 34
A.5 Pseudo-likelihood of a Gaussian vector . . . . . . . . . . . . . . . 34
A.6 Penalization upper bound . . . . . . . . . . . . . . . . . . . . . . 35
1
ar
X
iv
:0
81
0.
31
77
v1
[
sta
t.M
E]
1
7 O
ct
20
08
Page 2
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 2
1. Introduction
Estimating the concentration matrix (namely the inverse of the covariance ma-
trix) of a Gaussian vector in a sparse, high-dimensional setting has received
much attention recently. Graphical models provide a convenient setting for mod-
elling multivariate dependence patterns. In this framework, an undirected graph
is matched to the Gaussian random vector, where each vertex corresponds to
one coordinate of the vector, and an edge is not present between two vertices
if the corresponding random variables are independent, conditional on the re-
maining variables. Now, conditional independence between two coordinates of
the Gaussian random vector corresponds exactly to a zero entry in the concen-
tration matrix. Thus, detecting nonzero elements in the concentration matrix
is equivalent to reconstructing the Gaussian graphical model (GGM, see e.g.
Lauritzen 1996).
We focus here on the crucial problem of selecting the concentration matrix's
nonzero coecients. In other words, we focus on variable selection rather than
estimation. Application areas include gene-regulation graph inference in Biol-
ogy (using gene expression microarray data), as well as spectroscopy, climate
studies, functional magnetic resonance imaging, etc. We provide a very novel
approach driving the graph selection according to an unobserved modular struc-
ture on the vertices.
The idea of covariance selection rst appeared in the work of Dempster
(1972). In the so-called \large p, small n" setting (namely when the number
of observations is smaller than the dimension of the observed response), the
need for covariance selection is huge, as the empirical covariance matrix is no
longer regular.
In Drton and Perlman (2007), a classi cation of the di erent methods for
model selection/estimation in GGMs into three group types is suggested: constr-
aint-based methods, performing statistical tests; Bayesian approaches; and score-
based methods, maximizing a model-based criterion. The multiple testing prob-
lem has been taken into account in Drton and Perlman (2007; 2008). The authors
perform GGM covariance selection by multiple testing of hypotheses about van-
ishing partial correlation coecients. Such procedures may also be implemented
using the PC-algorithm (Kalisch and Buhlmann 2007). Starting from a com-
plete, undirected graph, the PC-algorithm deletes edges recursively, according
to conditional independence decisions. However, the statistical procedure in Dr-
ton and Perlman (2007) relies on asymptotic considerations, a regime never
attained in real situations.
Another attempt in this vein is to consider limited-order partial correlations
(Wille and Buhlmann 2006, Castelo and Roverato 2006). In Wille and Buhlmann
(2006) the authors consider only zero and rst-order conditional dependencies.
They argue that for sparse graphical models, these low-order dependencies still
re
ect reasonably well the full-order conditional dependency structure. More-
over, these dependencies may be well estimated even with a small number of
observations. In Castelo and Roverato (2006), the authors introduce a non-
rejection rate to reduce the multiple testing and computational problems to
which these approaches give rise.
A Bayesian framework is proposed in Dobra et al. (2004), Jones et al. (2005)
Page 3
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 3
and their method was applied for evaluating patterns of association in large-scale
gene expression data. The approach is based on dependency networks, namely
a collection of conditional distributions fP(XijXni)g (where Xni stands for the
set of all variables but Xi). However, such conditional distributions will not in
general result in a coherent joint distribution. This point is further discussed
below concerning a similar problem appearing in Meinshausen and Buhlmann
(2006). Moreover, constructing priors on the set of concentration matrices is not
a trivial task (it mainly relies on Wishart priors). The use of MCMC procedures
limits the range of applications to moderate-sized networks.
Before focusing on score-based methods, let us rst introduce regularization
procedures. In the context of linear regression, the Lasso (least absolute shrink-
age and selection operator) technique was introduced by Tibshirani (1996). This
procedure performs model selection and parameter estimation at the same time.
The idea is that ordinary least-squares criterion may be improved in a sparse
context, using an `1-norm penalty. The `1-norm penalty shrinks the estimates
to zero while preserving the convexity of the optimization problem. Note that
the `1-norm penalization is also known as 'basis pursuit' in signal processing
(Chen et al. 2001).
It is well known that if the ultimate goal is parameter estimation, model
selection and estimation should be done in a single step. Performing the model
selection prior to parameter estimation in the selected model will, in fact, result
in a non-robust procedure. However, our primary focus here is on model selec-
tion, as we want to infer sparse networks. We therefore concentrate on model
selection rather than on estimation performances.
The Lars algorithm (Efron et al. 2004) is one of the most popular techniques
for solving the Lasso problem. It gives the path of solutions obtained when
varying the penalty parameter (the penalty parameter is used as a scaling factor
of the `1-norm penalty). The larger the penalty parameter, the sparser the
Lasso solution.
Using convex optimization techniques (see for instance Minoux 1986), the
Lasso problem may be stated as a primal problem, whose dual formulation
may be solved more easily. This approach is taken in Osborne et al. (2000b).
The authors obtain an iterative algorithm, the \homotopy method" (Osborne
et al. 2000a). Other very ecient approaches are based on focusing on each coor-
dinate iteratively. Indeed, for each coordinate, the Lasso problem is solved very
simply (assuming the other coordinates are xed) by soft-thresholding (Donoho
and Johnstone 1995). Thus, di erent 'coordinate optimization' procedures have
been proposed in the literature. Following the work of Fu (1998), a cyclic pro-
cedure is proposed in Friedman et al. (2007), where optimization with respect
to each coordinate is done iteratively; whereas Wu and Lange (2008) propose a
greedy approach, computing the solution for each coordinate and choosing that
which provides the largest decrease in a surrogate objective function. Note that
these approaches rely on the underlying assumption that the predictors for the
regression problem are uncorrelated.
Let us now come back to covariance (or concentration) matrix inference in
GGMs, using maximization of a model-based criterion.
Meinshausen and Buhlmann (2006) were the rst authors to apply Lasso
techniques for inferring a covariance matrix in a GGM. Their approach is to solve
Page 4
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 4
p di erent Lasso regression problems, where p is the dimension of the observed
vector. The main drawback of such a procedure is that a symmetrization step is
required to obtain the nal network. It might, for instance, be the case that the
estimator of the regression coecient for Xi on Xj is zero, whereas the estimator
for Xj on Xi is not zero. Meinshausen and Buhlmann propose to use either an
\AND" or an \OR" nal step procedure to recover an undirected correlation
graph. However, these two procedures might result in di erent estimates and
there is no way of choosing between them. Moreover, as previously stated, a set
of conditional distributions does not necessarily cohere into a joint distribution.
Using a set of possibly non-coherent conditional distributions corresponds to
a pseudo-likelihood approach. This aspect was not underlined in Meinshausen
and Buhlmann (2006), and we clarify this point in Section 4.
Subsequently, two other articles, Banerjee et al. (2008) and Yuan and Lin
(2007) independently provided an improvement of the initial work of Mein-
shausen and Buhlmann (2006). In both works, the problem is seen as a penalized
maximum-likelihood (PML) problem. Instead of considering p di erent regres-
sion problems, these two articles focus on the likelihood of the Gaussian vector,
penalizing the entries of the concentration matrix with an `1-norm penalty.
They explain how the PML estimation may be solved as a \Lasso-like" prob-
lem. The major issue with PML strategies in the context of the concentration
matrix estimation of GGMs is to obtain a positive de nite estimate. However,
the approach for solving the problem in Yuan and Lin (2007) is not suited to
high-dimensional settings, in contrast to the approach proposed in Banerjee
et al. (2008). In Yuan and Lin (2007), a non-negative garrote-type estimator is
used, and asymptotic properties (as n tends to in nity while p is held xed)
are given. In Banerjee et al. (2008), two di erent algorithms are proposed for
solving problems in a high-dimensional setting. The rst approach relies on a
block-coordinate descent algorithm. The second is a semi-de nite programming
algorithm, based on Nesterov's method, which is computationally intensive.
The next improvement in this vein comes with Friedman et al. (2008). Re-
lying on coordinate descent techniques, previously described in Friedman et al.
(2007), the authors revisit Banerjee et al.'s rst approach and propose an e-
cient algorithm to solve the PML estimation problem, under the positive de nite
constraint. In fact, they use the block coordinate descent approach proposed by
Banerjee et al. (2008) and combine it with a second coordinate descent method.
Our method will make use of this approach.
To conclude this part, we remark that a completely di erent shrinkage esti-
mate was proposed by Schafer and Strimmer (2005) in the same context of large-
scale covariance matrix estimation. The approach consists in using a weighted
average of two di erent estimators, the rst being unconstrained (thus having
small bias but large variance), the second being low-dimensional (and thus ex-
hibiting small variance but large bias).
Now let us motivate the use of hidden structures in networks. Modularity
is a property observed in real (biological) networks (see for instance Ihmels
et al. 2002). Heterogeneity in the node behaviors is an important property of
these data. For example, so-called 'hubs' are highly-connected nodes, showing
a di erent behavior from the rest of the graph nodes. An interesting model
capturing these network features is a mixture model for random graphs (see
for instance Daudin et al. 2008). This model has been rediscovered many times
Page 5
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 5
in the literature, and a non-exhaustive bibliography should include Frank and
Harary (1982), Snijders and Nowicki (1997), Nowicki and Snijders (2001), Tall-
berg (2005), Daudin et al. (2008), Mariadassou and Robin (2007), Zanghi et al.
(2008). To state it simply, this model assumes that each node belongs to some
unobserved group. Conditional on the node groups, the (weighted) edges are
independent and identically-distributed (i.i.d.) random variables, whose distri-
bution depends on the groups of the nodes to be connected. As we are interested
in GGMs, weighted edges correspond to entries of the concentration matrix.
In this work, we aim at estimating a hidden structure, namely node groups,
while discovering the network. This hidden structure should help us in choosing
adaptive penalty parameters. Indeed, we wish to penalize the elements of the
concentration matrix, according to the unobserved clusters to which the nodes
belong. For instance, if two nodes belong to the same unobserved group, we wish
to lower the penalty parameter acting on the corresponding entry in the concen-
tration matrix. Conversely, if we increase the penalty parameters on the entries
corresponding to nodes belonging to di erent groups, we shrink the estimated
coecient to zero. Our approach is completely new and improves inference of
sparse modular networks.
Another adaptive Lasso procedure is given in Zou (2006), whose idea is to
lower the bias of the large coecients by adapting the penalty parameter of each
coecient so that it automatically scales with the inferred value. It is known
that the non-adaptive Lasso procedure may result in inconsistent parameter
estimation. An illustration of the con
ict between optimal prediction and con-
sistent variable selection for the Lasso procedure is given in Meinshausen and
Buhlmann (2006). They proved that the optimal penalty parameter for predic-
tion gives inconsistent variable selection results, motivating the use of another
penalty parameter to ensure the control of the probability of falsely connecting
two or more distinct connectivity components of the graph. Like them, we also
focus on optimal selection rather than on optimal prediction. The adaptivity of
our procedure is not used for lowering the bias of large coecients, but instead
for constraining the prediction to t the underlying structure of the graph.
Model. Let us now brie
y describe the general approach of our work. The
model will be presented in detail in Section 2. Let X = (X1; : : : ; Xp)| be a
Gaussian random vector in Rp, with zero mean and positive de nite covariance
matrix , namely X  N (0p;): We observe independent and identically-
distributed (i.i.d) vectors (X1; : : : ; Xn) with the same distribution as X. The
matrix K = 1 is the concentration matrix of the model. Let S be the empirical
covariance matrix. The log-likelihood of the observations is given by
L(K) =
n
2
log det(K)
1
2
nX
k=1
(Xk)|KXk + c =
n
2
log det(K)
n
2
Tr(SK) + c
where c is a constant term.
The `1-penalized estimator proposed by Banerjee et al. (2008) is given by
bK = arg max
K0
log det(K) Tr(SK) kKk`1 ; (1)
Page 6
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 6
where K  0 stands for positive de niteness,  > 0 is a penalty parameter and
kKk`1 =
P
ij jKij j.
A natural generalization of this approach is to have di erent penalty param-
eters for di erent entries Kij . Namely,
log det(K) Tr(SK) k(K)k`1 ;
where (K) = (ij(Kij))i;j2P is a matrix of penalty functions acting on each
entry. As a general rule, using as many penalty functions as there are entries in
the concentration matrix to be estimated is not meaningful.
Here, we propose to take into account a hidden structure on the correlations
between the coordinates random variables Xk. Thus, we consider latent i.i.d.
random variables Z1; : : : ;Zp with values in a nite set f1; : : : ; Qg. Each variable
Zi describes the state of Xi, and we wish to adapt the penalty function ij with
respect to the states of Xi; Xj . More precisely, we wish to use a criterion of the
form
log det(K) Tr(SK) kZ(K)k`1 ;
where Z(K) = (ZiZj (Kij))i;j2P is a matrix of random penalty functions whose
entries depend on the latent structure Z = Z1; : : : ;Zp. However, the hidden
structure is not supposed to be known, thus we cannot rely on the previous
criteria. Intuitively, following the principle of Expectation-Maximization (em)
algorithm of Dempster et al. (1977), the idea will be to replace the unobserved
value Z(K) with its conditional expectation E(Z(K)jX
1; : : : ; Xn; K(m)) under
some model with parameter K(m), and iterate the following steps
(e) Compute E(Z(K)jX
1; : : : ; Xn; K(m))
(m) Update K(m+1) = argmaxK0 E(Z(K)jX
1; : : : ; Xn; K(m)).
One of our aims is to provide a very simple framework for such an analysis.
Note that the `1-norm used here acts on diagonal elements of the matrix K. It
is counter-intuitive to penalize diagonal elements of the concentration matrix,
as these do not re
ect sparsity in the correlation structure. However, from a
technical point of view, this strategy ensures that the procedure will select a
positive de nite estimator (see Remark 2). This point was not emphasized in
the previous procedures using `1 penalized likelihood of GGMs.
Road-map. In Section 2 we present the model and the penalized maximum-
likelihood criterion on which we base our inference procedure, described in Sec-
tion 3. This procedure relies on a variational em algorithm, combined with a
Lasso-like procedure. Section 4 explains how Meinshausen and Buhlmann's
approach may be interpreted as a penalized pseudo-likelihood method. Finally,
Section 5 illustrates the performance of the method on synthetic data, for which
an R{package, SIMoNe (Statistical Inference for Modular Network), can be down-
loaded from the rst author's website. We also test our algorithm on a real data
set provided by Hess et al. (2006) and concerning n = 133 patients with breast
cancer treated using chemotherapy. According to Hess et al. (2006) and Natow-
icz et al. (2008), the patient response to the treatment can be classi ed either
as a pathologic complete response (pCR), or as a residual disease (not-pCR).
The prediction of the patient response is achieved accurately by studying the
expression levels of a limited number of genes (p = 26). Our algorithm is ap-
plied on each class of patients (pcR and not-pCR). Two distinct gene-regulatory
Page 7
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 7
networks are thus inferred, showing a very di erent structure according to the
selected class of patients.
2. A latent structure model for network inference
In this section we present a framework for modelling heterogeneity among depen-
dencies between the variables. To this end, let us rst recall classical notations
from Gaussian Graphical Models (see Lauritzen 1996, for elementary results
about GGMs).
2.1. Gaussian graphical models: general settings
Let P = f1; : : : ; pg be a set of xed vertices, X = (X1; : : : ; Xp)| a random
vector describing a signal over this set and a sample (X1; : : : ; Xn) of size n with
the same distribution as X.
The vector X is assumed to be Gaussian with positive de nite covariance
matrix  = (ij)(i;j)2P2 . No loss of generality is involved when centering X,
so we may assume that X  N (0p;). GGMs are based on a classical result,
originally emphasized by Dempster (1972), claiming that any couple of entries
(Xi; Xj) with i 6= j are independent conditional on all other variables indexed
by Pnfi; jg, if and only if the entry (1)ij is zero. The inverse of the covari-
ance matrix K = (Kij)(i;j)2P2 = 
1, known as the concentration matrix, thus
describes the conditional independence structure of X. Moreover, each entry
Kij ; i 6= j is directly linked to the partial correlation coecient rijjPnfi;jg be-
tween variables Xi and Xj . In fact, we have rijjPnfi;jg = Kij=
p
KiiKjj , and
also Kii = Var(XijXPni)1. Hence, after a simple rescaling, the matrix K can
be interpreted as the adjacency matrix of an undirected weighted graph G rep-
resenting the partial correlation structure between variables X1; : : : ; Xp. This
graph has no self-loop, with a random set of edges composed by all pairs (i; j)
such that Kij 6= 0. Note that we are seeking only pairs of vertices (i; j) such
that i < j, since there is no self-loop, and since Kij = Kji. Inferring nonzero
entries of K is equivalent to inferring G, and is therefore a highly relevant issue
in this framework.
2.2. Providing the network with a latent structure
Let us now extend the modeling by providing the network with an internal latent
structure.
The model proposed in Daudin et al. (2008) attempts a better t of data, as
it places the network G in the mixture framework, in order to take account of
the heterogeneity among vertices. The same general mixture model is adopted
here: vertices of P are distributed among a set Q = f1; : : : ; Qg of hidden clusters
that model the latent structure of the network. For any vertex i, the indicator
variable Ziq is equal to 1 if i 2 q and 0 otherwise, hence describing which cluster
the vertex i belongs to. A vertex is assumed to belong to one cluster only,
thus the random vector Zi = (Zi1; : : : ; ZiQ) obviously follows a multinomial
distribution. Namely,
Zi M(1; ); (2)
where = ( 1; : : : ; Q) is a vector of cluster proportions, so that
P
q q = 1.
Page 8
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 8
The concentration matrix structure. We shall now extend the clustering
of vertices from P to the concentration matrix K. Accordingly, both the exis-
tence and the weight of edges, described by the o -diagonal elements of K, will
depend on the cluster each vertex belongs to. Conditional on the events i 2 q
and j 2 ` where q; ` are clusters chosen from Q, each Kij (i 6= j) is a random
variable whose probability density function is denoted by fq`, that is,
Kij j fZiqZj` = 1g  fq`(); i 6= j: (3)
It will be remarked that in this formulation the variables Kij are assumed to
be independent, conditional on the clusters the vertices belong to. Moreover,
we are considering only undirected graphs, so we may assume that fq` = f`q.
For technical reasons (see Remark 2), we also assume a prior distribution on
diagonal elements of K, namely
Kii  f0():
Our suggestion is to adopt Laplace distributions; hence
8x 2 R; fq`(x) =
1
2q`
exp


jxj
q`

; and f0(x) =
1
20
exp


jxj
0

; (4)
where q`; 0 > 0 are scaling parameters and q` = `q. Below, the parameter
0 will be xed and not estimated.
The reason for choosing a Laplace distribution is that it is reminiscent of
the `1-norm, itself linked to Lasso-techniques for which appropriate tools are
available. In fact, when considering the general penalized least-square problem,
the penalty term can be seen as a log-prior density on the vector of parameters.
In the case of Lasso, the prior distribution corresponding to the `1-norm is
actually the Laplace distribution (see, e.g. Hastie et al. 2001).
The aliation model. The aliation model is a special case of network
structure (to be investigated below), where there are many di erent clusters,
but where the focus is restricted to two types of edges: edges between nodes
of the same cluster, and edges between nodes from di erent clusters. In the
aliation model the densities fq` in (4) are of only two kinds; that is, for all
q; ` 2 Q, let
fq` =

fqq = fin(;in) if q = `; the intra-cluster density of edges;
fq` = fout(;out) if q 6= `; the inter-cluster density of edges:
(5)
2.3. The complete likelihood
Having described the modeling of the network, we now focus on the inference
issue.
We denote as X the np matrix that contains the data-set fX1; X2; : : : ; Xng
row-wisely organized, i.e., (Xk)| is the kth row of X. Furthermore, we denote
as Z = fZiqgi2P;q2Q the set of all latent indicator variables for vertices. For the
sake of simplicity, the number of clusters Q and the parameters = ( q)q2Q
and  = fq`gq;`2Q are assumed to be known for the moment.
The data experiments X are the only observations available, and from these
we should like to be able to infer the graph G of conditional dependencies or,
Page 9
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 9
equivalently, nonzero entries of K. As the matrix K has been given a prior
distribution, our aim is to maximize the posterior probability of K, given the
data X, or equivalently, the logarithm of the joint distribution. The estimate is
thus de ned as follows:
bK = arg max
K0
P(KjX) = arg max
K0
logP(X;K);
where K  0 stands for positive-de niteness.
To solve this problem, we place ourselves in the classical complete-data frame-
work. The distribution of K is only known conditionally on the latent structure
described by Z. We denote as Z the set of all possible clusterings over nodes
from P. The marginalization over the latent clusters Z leads to
bK = arg max
K0
log
X
Z2Z
Lc(X;K;Z);
where the so-called complete-data likelihood Lc(X;K;Z) = P(X;K;Z) is the
function we shall develop using an em-like strategy hereafter. For this purpose,
a closed form of Lc is required.
Proposition 1. The following relation holds for the complete-data likelihood
Lc.
logLc(X;K;Z) =
n
2
(log det(K) Tr(SK)) kZ(K)k`1

X
i;j2P;i6=j
q;`2Q
ZiqZj` log(2q`) +
X
i2P;q2Q
Ziq log q + c; (6)
where S = n1(X X)|(X X) is the empirical covariance matrix, c is a
constant term and Z(K) =

ZiZj (Kij)

i;j2P
is de ned by
ZiZj (Kij) =
8
>>>><
>>>>:
X
q;`2Q
ZiqZj`
jKij j
q`
if i 6= j;
jKiij
0
otherwise:
(7)
Proof. Using the Bayes rule, Lc divides into three terms:
logLc(X;K;Z) = logP(X;K;Z) = logP(XjK) + logP(KjZ) + logP(Z);
where we make use of the fact that logP(XjK;Z) = logP(XjK).
The rst term is the likelihood associated with a size-n sample of a multi-
variate Gaussian distribution, since X  N (0p;). Routine computations lead
to
logP(XjK) =
n
2
log det(K)
n
2
Tr(SK)
np
2
log(2):
Page 10
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 10
As regards the second term, using the expression (4), we have
logP(KjZ) =
X
i;j2P;i6=j
q;`2Q
ZiqZj` log fq`(Kij) +
X
i2P
log f0(Kii)
=
X
i;j2P;i6=j
q;`2Q
ZiqZj`

jKij j
q`
+ log(2q`)


X
i2P
jKiij
0
p log(20):
From (2), we have logP(Z) =
P
i;q Ziq log q, and the result follows.
3. Inference strategy by alternate optimization
In the classical em framework developed by Dempster et al. (1977), where X
is the available data, inferring the unknown parameters K spread over a latent
structure Z would make use of the following conditional expectation:
Q

KjK(m)

= E
n
logLc(X;K;Z)

X; K(m)
o
=
X
Z2Z
P

Z

X;K(m)

logLc(X;K;Z) =
X
Z2Z
P

Z

K(m)

logLc(X;K;Z);
(8)
where K(m) is the estimation of K from the previous step of the algorithm.
The usual em strategy would be to alternate an E-step computing the con-
ditional expectation (8) with an M-step maximizing this quantity over the pa-
rameter of interest K. Unfortunately, no closed form of Q

KjK(m)

can be
formulated in the present case. The technical diculty lies in the complex de-
pendency structure contained in the model. Indeed, P(ZjK) cannot be factor-
ized, as argued in Daudin et al. (2008). This makes the direct calculation of
Q

KjK(m)

impossible. To tackle this problem we use a variational approach
(see, e.g., Jaakkola 2000, for elementary results on variational methods). In
this framework, the conditional distribution of the latent variables P(ZjK(m))
is approximated by a more convenient distribution denoted by Rm(Z), which
is chosen carefully in order to be tractable. Hence, our em-like algorithm deals
with the following approximation of the conditional expectation (8)
ERm flogLc(X;K;Z)g =
X
Z2Z
Rm(Z) logLc(X;K;Z): (9)
In the following section we develop a variational argument in order to choose
an approximation Rm(Z) of P(ZjK(m)). This enables us to compute the condi-
tional expectation (9) and proceed to the maximization step.
3.1. Variational estimation of the latent structure (E-step)
In this part, K is assumed to be known, and we are looking for an approximate
distribution R() of the latent variables. The variational approach consists in
maximizing a lower bound J of the log-likelihood logP(X;K), de ned as follows:
J (X;K; R(Z)) = logP(X;K)DKL fR(Z)kP(ZjK)g (10)
Page 11
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 11
where DKL is the Kullback-Leibler divergence. This measures the di erence
between the probability distribution P(jK) in the underlying model and its ap-
proximation R(). An intuitively straightforward choice for R() is a completely
factorized distribution (see Mariadassou and Robin 2007, Zanghi et al. 2008)
R (Z) =
Y
i2P
h i(Zi); (11)
where h i is the density of the multinomial probability distribution M(1;  i),
and  i = (i1; : : : ; iQ) is a random vector containing the variational param-
eters to optimize. The complete set of parameters  = fiqgi2P;q2Q is what
we are seeking to obtain via the variational inference. In the case in hand the
variational approach intuitively operates as follows: each iq must be seen as
an approximation of the probability that vertex i belongs to cluster q, condi-
tional on the data, that is, iq estimates P(Ziq = 1jK), under the constraintP
q iq = 1. In the ideal case where P(ZjK) can be factorized as
Q
i P(ZijK)
and the parameters iq are chosen as iq = P(Ziq = 1jK), the Kullback-Leibler
divergence is null and the bound J reaches the log-likelihood.
The following proposition gives the form of the lower bound J to be maxi-
mized in order to estimate  .
Proposition 2. Let us assume that R can be factorized as in (11), and let us
denote J (X;K) := J (X;K; R (Z)). Then J satis es the following expres-
sion
J (X;K) = c
X
i2P
q2Q
iq log iq +
X
i2P
q2Q
iq log q
k (K)k`1
X
i;j2P;i6=j
q;`2Q
iqj` log 2q`; (12)
where c does not depend on  and  (K) = ( i j (Kij))i;j2P2 is de ned simi-
larly as (7), replacing Ziq by iq.
Proof. Starting from (10), classical results on variational methods show that
J (X;K) = bQ (K) +H(R (Z));
where H(R ()) is the entropy of the distribution R () and bQ (K) is the ap-
proximation of the complete log-likelihood conditional expectation, computed
under the distribution R . Namely,
bQ (K) = ER flogLc(X;K;Z)g =
X
Z2Z
R (Z) logLc(X;K;Z): (13)
In the special case of factorized distribution (11), the entropy is
H(R (Z)) =
X
i2P
H(h i(Zi)) =
X
i2P;q2Q
iq log iq:
Moreover,
bQ (K) = logP(XjK) + ER [logP(KjZ)] + ER [logP(Z)]:
Page 12
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 12
Equation (12) follows via Proposition 1, by using that ER (Ziq) = iq and
ER (ZiqZj`) = iqj`.
The optimal approximate distribution R is then derived by direct maxi-
mization of J . The following proposition gives the estimate b that solves the
problem.
Proposition 3. Let and  be known. The following xed-point relationship
holds for the optimal variational parameters b = arg max J
biq _ q
Y
j2Pnfig
`2Q

1
2q`
exp


jKij j
q`
bj`
; (14)
where _ means that there is a scaling factor such that for any i 2 P, we have
P
q biq = 1.
Proof. This is just an adaptation to the Laplace case of Mariadassou and Robin
(2007, Proposition 3).
The initial value of  is chosen using a classi cation algorithm such as spectral
clustering (see for instance Ng et al. 2002). As a consequence, the initial values
for iq lie in f0; 1g. We then use an iterative procedure setting b
(m+1) = g(b (m)),
where g is the function (implicitly de ned above) for which b is a xed point.
Note that we cannot ensure uniqueness of the xed point for g, nor convergence
of this iterative procedure. In practice, we can always use a maximal number
of iterations, and if convergence has not occurred, we keep the initial value of
 given by the clustering method. In appendix A.2 we explain that at least in
the aliation model (5), if the current values K(m)ij of the precision matrix are
small enough, and if the penalty parameters 1in and 
1
out are well-chosen, then
uniqueness of the xed point is ensured. However, such a result does not hold
in the general case, which is one of the drawbacks of of the variational approach
in this context.
Estimation of and . The parameters and  have been previously
considered as known to keep the statement as clear as possible.
Two di erent strategies may be used with respect to these parameters. The
rst approach is to x their values. Fixing the value of comes down to choos-
ing a priori the proportions of the groups, which is quite a common strategy
in mixture models. As for the choice of , this is equivalent to choosing the
penalty parameter in the classical lasso. Concerning general parameters , a
number of values need to be determined, which might be a problem. However
in the particular aliation model (5), only 2 parameters have to be xed: a pa-
rameter in that corresponds to a light penalty, since many intra-cluster edges
are expected, and another parameter out that ts with a heavier penalty, since
we do not expect many inter-cluster edges. This is typically the kind of strategy
that will be used for numerical applications (see Section 5). More generally, the
matrix penalty can be tuned to obtain a desired quantity of inferred edges, or
to constrain the topology of the graph, e.g. graphs with hubs.
The second strategy is to make use of the current inferred graph to estimate
the parameters. The basic idea is to include this estimation in the variational
Page 13
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 13
method. Unfortunately, the maximization of J given in equation (12) with
respect to  ,  and at the same time is not possible. To tackle this problem,
we use an alternate strategy. The parameter  is computed with the xed-
point relationship (14) for xed values of  and . Then we maximize J with
respect to  and , once R is xed (that is, once  is xed), as in the following
proposition. We successively iterate these two steps until stabilization.
Proposition 4. For xed values of  , the parameters ^, ^ maximizing J are
given by
8q; ` 2 Q; ^q =
1
p
X
i2P
iq and ^q` =
P
i 6=j iqj`jKij j
P
i 6=j iqj`
:
Proof. Once terms that do not depend on the parameters of interest have been
removed from J , the problem becomes
^q = argmax
q
X
i
iq log q and ^q` = argmax
q`

X
i 6=j
iqj`

jKij j
q`
+ log 2q`

:
Null-di erentiation with respect to q (under the constraint
P
q q = 1) and
q` leads straightforwardly to the result.
3.2. A Lasso-like method to estimate the concentration matrix (the
M-step)
Now that we are able to compute the approximate conditional expectation
bQ (K) de ned by (13), we wish to infer the concentration matrix K, assuming
 is known. This is the aim of the m-step of our em{like strategy, that deals
with the maximization problem arg maxK0 bQ (K).
Using Proposition 1 and the equality ER (ZiqZj`) = iqj`, it is a simple
matter to rewrite the problem as follows
bK = argmax
K0
nn
2
(log det(K) Tr(SK)) k (K)k`1
o
: (15)
Hence, our m{step can be seen as a penalized maximum likelihood estimation
problem, exactly like in Friedman et al. (2008), Banerjee et al. (2008). The
likelihood considered here is P(XjK), that is, the likelihood which corresponds
to the n realizations of the Gaussian vector X for a given concentration matrix
K. The di erence of our approach lies in the complexity of the penalty term,
and in slight discrepancies as regards some constant factors.
Remark 1. Since we are using a penalty term 1=0 on matrix K's diagonal
elements, the solution to (15) satis es
8i 2 P; bK1ii = Sii + 2=(n0); (16)
when 10 < njSiij=2 for any i 2 P. Indeed, the sub-gradient equation is
n=2(K1ii Sii) + sgn(Kii)=0 = 0, and Kii  0 since it is the inverse of a
conditional variance.
Let us now look at the solution of the m-step: the following proposition gives
an equivalent formulation of (15) that is more likely to be solved. The result
draws its inspiration from Banerjee et al. (2008).
Page 14
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 14
Proposition 5. The maximization problem (15) over the concentration matrix
K is equivalent to the following, dealing with the covariance matrix 
b = argmax
k(S)=Pk11
log det(); (17)
where 

is the term-by-term division and
P = (P i j )i;j2P with P i j =

2n1
P
q;` iqj`
1
q` i 6= j;
2(n0)1 i = j:
Remark 2. By penalizing the diagonal terms of the concentration matrix K in
the initial problem, the set of matrices  over which we maximize our criterion
contains, for instance, the matrix S+2=(n0)I, (where I stands for the identity
matrix). Thus, provided that the value of penalty parameter 1=0 is set su-
ciently high, this set contains positive de nite matrices. This ensures that our
estimator is always invertible. Obviously, when S is invertible, which is usually
true for n greater or equal than p, penalizing the diagonal terms becomes futile.
In this case 1=0 is set to zero.
Proof. The penalty term in (15) can be written as follows
k (K)k`1 =
X
q;`2Q
X
i;j2P
i 6=j
jKij j
q`
iqj` +
X
i2P
jKiij
0
=
X
q;`2Q
kTq` ?Kk`1 ;
where ? is the term-by-term product. The set fTq`gq;`2Q contains p  p sym-
metric matrices, de ned, for each couple (q; `), by
Tq` = (Tq`;ij)i;j2P with 8i 6= j; Tq`;ij =
iqj`
q`
and Tq`;ii =
1
0Q2
:
Let us now use the fact that kAk`1 = maxkUk11 Tr(AU), for a given matrix
A. The optimization problem (15) can now be written as
max
K0
min
fUq`:kUq`k11g
8
<
:
n
2
log det K Tr
0
@
n
2
SK +
X
q;`2Q
(Tq` ?K) Uq`
1
A
9
=
;
;
since the trace operator is linear. The dual version of the above expression is ob-
tained by swapping max and min. The maximization is solved by di erentiating
with respect to K. To do this, we recall that in our speci c case the matrices
T are symmetrical, and thus Tr ((T ?K)U) = Tr (K(T ?U)). Then, applying
the usual rules for the derivative of the trace operator, null-di erentiation with
respect to K yields
 := K1 = S +
2
n
X
q;`2Q
(Uq` ?Tq`) : (18)
The dual problem therefore becomes
min
fUq`:kUq`k11g
n

n
2
log det()
np
2
o
;
Page 15
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 15
or in other words,
max
fUq`:kUq`k11g
log det():
Finally, we need to write the constraint as a function of  rather than the set
fUq`g. In fact, we simply need to show that
fUq`;8q; ` 2 Q; kUq`k1  1g =

;

( S) 

P


1
 1

;
which is straightforward (see Appendix A.1 for details).
To solve (17) and thus obtain the estimate b, we successively use two coordi-
nate descent methods. The rst corresponds to a block-wise strategy suggested
by Banerjee et al.. The second one is used to solve the resulting Lasso problem
and was suggested by Friedman et al. (2007).
Let us rst explain the block-wise strategy. For this purpose, we introduce
the following notation for b, S and the penalty matrix P
b =
"
b11 b12
b|12 b22
#
; S =

S11 s12
s|12 S22

; P =

P11 p12
p|12 P22

; (19)
where b11, S11 and P11 are (p1)(p1) matrices, b12, s12 and p12 are (p1)
length column vectors and b22, S22 and P22 are real numbers. We have already
remarked (Remark 1) that the solution to (17) satis es b22 = S22 + 2=(n0).
Moreover, using Schur complement, the vector b12 satis es
b12 = argmin
fy:k(ys12)=p12k11g
n
y| b
1
11 y
o
: (20)
We have det(b) = det(b11)(b22 b
|
12
b
1
11 b12). The full matrix b is approx-
imated in the following way: rst, if required when p is greater than n, we
initialize the procedure with S+2=(n0)I, where 0 > 0 is chosen so as to make
S + 2=(n0)I invertible; secondly, we permute the columns (and thus the rows)
of b and iteratively solve problems like (20) until convergence of the procedure.
This convergence is ensured by the following lemma.
Lemma 1. The procedure which starts with a positive de nite matrix and iter-
atively updates the columns and rows of this matrix according to the solutions
of (20) converges to the solution b of (17).
Proof. The proof relies on Banerjee et al. (2008, Theorem 3) and Tseng (2001,
Theorem 4.1). Convergence of block-coordinate descent methods is a well-documented
topic in convex optimization literature. Here, we have to bear in mind that using
`1-norm penalty leads to non-di erentiable functions. Thus, we rely on a result
by Tseng (2001, Theorem 4.1), which in our case ensures the convergence of the
procedure, provided there is at most one solution to each minimization problem
(20). This point is proved in Banerjee et al. (2008, Theorem 3).
Then, starting from a result given in Banerjee et al. (2008), an interpretation
of (20) as an `1{penalized problem is given in Friedman et al. (2008). This `1{
penalized problem is reminiscent of the Lasso and may thus be solved using a
coordinate descent strategy (Friedman et al. 2007). The following proposition
Page 16
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 16
enunciates a result similar to those obtained in Banerjee et al. (2008, equation
(6)) and Friedman et al. (2008, equation (2.4)), although with a more general
penalty term and a factor 12 that di ers. Since none of these articles gives an
explicit proof for this result, it is tting that we provide our own proof here.
Proposition 6. Solving (20) is equivalent to solving the dual problem
b = argmin





1
2
b
1=2
11 b
1=2
11 s12




2
2
+ kp12 ? k`1 ; (21)
where solution b12 to (20) and b to (21) are linked through
b12 = b11b =2: (22)
Proof. Problem (20) can be written as follows, by splitting the constraint:
8
<
:
miny y| b
1
11 y
subject to (p12)i  yi (s12)i (p12)i  0; 8i = 1; : : : ; p 1;
or (p12)i  yi + (s12)i (p12)i  0; 8i = 1; : : : ; p 1:
Let us introduce L the so-called Lagrangian, with vectors of Lagrange coe-
cients denoted by 1 = ( 1i )ip1;
2 = ( 2i )ip1 with nonnegative entries.
Also, let = 2 1. The Lagrange version of the above problem is
min
y

y| b
1
11 y + max

L( )

; (23)
where, in the present case, L is given by
L( ) =
X
i
1i (yi (s12)i (p12)i) +
X
i
2i (yi + (s12)i (p12)i) ;
The coecients 1i and
2
i maximizing L( ) are null when the constraints are
satis ed, and for each index i, at least one coecient among f 1i ;
2
i g is zero.
Then
k k`1 =
X
i
j ij =
X
i

1i +
2
i

:
Meanwhile, consider the dual problem of (23), swapping min and max: the
solution that minimizes the dual problem with respect to y satis es the null-
gradient hypothesis. We obtain 2b
1
11 y = 0, that is y =
1
2
b11 (which
proves equation (22)). Introducing this result in the dual of (23), we get
max


1
4
| b11 + s
|
12
X
i

1i +
2
i

(p12)i;
also equivalent to
min

1
4
| b11 s
|
12 + kp12 ? k`1 :
Expressing this quantity by using the Euclidean norm achieves the proof.
Hence, the column b12 of the estimated covariance matrix b is computed by
solving the Lasso problem (21) using another coordinate descent method.
Page 17
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 17
Lemma 2. The solution to (21) is computed by updating the jth coordinate of
b via
b j = 2S
0
@(s12)j
1
2
X
k 6=j
(b11)jk b k ; (p12)j
1
A =(b11)jj ; (24)
where S(x; ) = sgn(x)(jxj )+ is the soft-thresholding operator.
Moreover, the procedure which iteratively updates the entries of vector b12 =
b11b =2 according to the solutions b of (24) converges to the solution of (20).
Proof. The proof of this lemma is postponed to Appendix A.3.
Finally, the estimate of the matrix of concentration K is recovered by in-
verting b, which can be done at low computational cost (see appendix A.4 for
details). Hence, we solve the initial maximization problem (15) that de nes the
m-step of our algorithm.
Implementation of the full em algorithm is outlined in Algorithm 1.
Algorithm 1: The full em{like algorithm
while bQ (bK(m)) has not stabilized do
//THE E-STEP: LATENT STRUCTURE INFERENCE
if m = 1 then
// First pass
Apply spectral clustering on the empirical covariance S to initialize b
else
Compute b with the xed-point relationship (14), using bK(m1)
//THE M-STEP: NETWORK INFERENCE
Construct the penalty matrix P according to b
while b
(m)
has not stabilized do
for each column of b
(m)
do
Compute b12 by solving the lasso{like problem with path-wise coordinate
optimization
Compute bK(m) by block inversion of b
(m)
m m + 1
3.3. Choice of penalty parameters
As previously stated, the penalty parameters  may be estimated in the e-step
of the algorithm (see subsection 3.1). However, this choice is not necessarily
optimal for the estimation of K, and other choices might in practice lead to a
better solution. A good strategy is to keep the estimated value of  in the e-step
that leads to the estimation of  , and to impose another value of  during the
m-step. In this part, we indicate a possible choice for the penalty parameters to
use in the m-step, ensuring a small error on the connectivity components of the
estimated graph.
Let us rst introduce some notation. For any node i 2 P, let Ci denote the
connectivity component of node i in the true underlying conditional dependency
Page 18
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 18
graph, and bCi the corresponding component resulting from the estimate bK of
this graph structure. The following proposition is based on Meinshausen and
Buhlmann (2006, Theorem 2) and Banerjee et al. (2008, Theorem 2).
Proposition 7. Fix some " > 0 and choose the penalty parameters  such that,
for all q; ` 2 Q,
2p2Fn2
0
@
2
nq`

max
i6=j
SiiSjj
1
2q`
!1=2
(n 2)1=2
1
A  "; (25)
where 1 Fn2 is the c.d.f. of Student's t-distribution with n 2 degrees of
freedom. Then
P(9k; bCk * Ck)  ": (26)
Proof. Here we simply indicate the main di erences between the proof of Baner-
jee et al. (2008, Theorem 2) and what is valid in our context. Note that according
to (15), the estimator bK must satisfy the following sub-gradient equation
8i 6= j;
n
2

bK1ij Sij


0
@
X
q;`
ZiqZj`
q`
1
A ij = 0
where ij 2 sgn( bKij). Following the proof of Banerjee et al. (2008, Theorem 2),
we easily get
P(9k; bCk * Ck)  p2 max
i2P;j =2Ci
P
0
@
n
2
jSij j 
X
q;`
ZiqZj`
q`
1
A :
Performing some computations involving the correlation between variables Xi
and Xj , we also obtain
P(9k; bCk * Ck)  2p2 max
q;`2Q
Fn2
0
@
2(n 2)1=2
nq`

max
i2P;j =2Ci
SiiSjj
1
2q`
!1=2
1
A ;
which entails the conclusion.
Remark 3. Following Banerjee et al. (2008), note that in order to ensure (25),
it is enough to choose the penalty parameter  such that, for all q; ` 2 Q,
q`(") 
2
n

n 2 + t2n2

"
2p2
1=2
max
i 6=j
SiiSjj
1=2
tn2

"
2p2
1
;
where tn2(u) is the (1 u)-quantile of Student's t-distribution with (n 2)
degrees of freedom, i.e. Fn2(tn2(u)) = u.
Remark 4. Inequality (25) does not take into account that di erent penalty
parameters are used for di erent hidden classes q; ` 2 Q. An adaptation of the
preceding strategy is to use current values Z(m) obtained from the probabilities
 (m) of the hidden classes and to choose the current penalty parameters (m)
accordingly. More precisely, let us set, for instance
8i 2 P; Z(m)iq =

1 if q = argmax` 
(m)
i`
0 otherwise:
Page 19
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 19
Then, when
2p2Fn2
0
B
B
B
@
2
n(m)q`
0
B
B
@ maxi6=j
Z(m)iq Z
(m)
j` =1
SiiSjj
1
((m)q` )
2
1
C
C
A
1=2
(n 2)1=2
1
C
C
C
A
 "; (27)
for all q; ` 2 Q, the current estimate bK(m) of the dependency graph will approx-
imately satisfy (26). Moreover, in order to ensure (27), it is enough to choose,
for all q; ` 2 Q,
(m)q` (") 
2
n

n 2 + t2n2

"
2p2
1=2
0
B
B
@ maxi 6=j
Z(m)iq Z
(m)
j` =1
SiiSjj
1
C
C
A
1=2
tn2

"
2p2
1
:
(28)
Typically, the kind of values obtained with (28) will lead to large penalties
and, consequently, to very sparse graphs: practically, more informative networks
can be obtained by replacing the term "=2p2 in (28) by greater values. In any
cases, (28) should be seen as a starting value.
4. Link with Meinshausen and Buhlmann's approach
We should also like to ll the gap between, on the one hand solving (15) and,
on the other, the approach proposed in Meinshausen and Buhlmann (2006),
where p independent penalized regression problems are solved using the Lasso.
In fact, we shall show that Meinshausen and Buhlmann's approach is equivalent
to maximizing the penalized pseudo log-likelihood corresponding to the size-
n sample of the multivariate Gaussian vector X on the set of non symmetric
matrices. Let us denote as eL this pseudo-likelihood, de ned by
log eL(X; K) =
X
i2P

nX
k=1
logP(Xki jX
k
Pni; Ki)
!
;
where XkPni is the kth realization of the Gaussian vector X, once the ith coor-
dinate has been removed. In this section, the `1-norm of matrices is restricted
to o -diagonal elements only, that is, kAk`1 =
P
i 6=j jAij j.
Proposition 8. Consider the solution bKpseudo to the penalized pseudo-likelihood
problem
bKpseudo = argmax
fKij ;i6=jg
log eL(X; K) kP ?Kk`1 ; (29)
(whose diagonal is xed) and the solution bKMB given in Meinshausen and
Buhlmann (2006) to the p di erent regression problems, using the matrix penalty
2P=n. The two solutions have exactly the same null entries.
Proof. Denote by Knini and Snini, respectively, the matrices K and S once
their ith row and ith column have been removed. Moreover, Kini and Sini are
the ith rows of the matrices with the ith term removed. After some routine
Page 20
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 20
computations, and using classical results for Gaussian multivariate vectors (see
Appendix A.5), it can be shown that
log eL(X; K) =
n
2
X
i2P

logKii KiiSii 2SiniKini
1
Kii
KiniSniniK
|
ini

+ c;
(30)
where c does not depend on K. Thus, if we forget the symmetry constraint
on K, maximizing the pseudo-likelihood (30) with respect to the non-diagonal
entries of K is equivalent to p independent maximization problems with respect
to each column K|ini. Consider, for instance, the last column of K, that is, for
i = p, and the relative term in (30). This term can be written as

n
2K22

2K22s
|
12K
|
ini + K
|
iniS11Kini

=
n
2K22


S
1=2
11 K
|
ini +K22S
1=2
11 s12



2
2
+ c0;
where we use the block-wise notation de ned above (19). The term C 0 does not
depend on Kini, which is the current column of the concentration matrix to
infer. Namely, c0 = K222s
|
12S
1
11 s12.
Consider now the penalized version of the log-likelihood (29): we wish to solve
p penalized problems of minimization as de ned above, which can be written as
follows
min



S
1=2
11 +K22S
1=2
11 s12



2
2
+
2K22
n
kp12 ? k`1 : (31)
Meinshausen and Buhlmann wish to solve p Lasso-problems, for instance for
the last variable p,
min

1
n

Xp Xnp

2
2
+

2n1p12 ?


`1
; (32)
where Xp is the pth column of X and Xnp is the matrix of data the pth column
has been removed (note that we adapted the penalization term corresponding
to the framework developed here).
The minimum is reached in (31) for null-di erentiation, and we get
2S11 + 2K22s
|
12 +
2K22
n
p12 ?  = 0;
where  2 sign( ). The same for (32), and we get
2
n
X|npXnp
2
n
X|pXnp + 2n
1p12 ?
= 0;
where
2 sign( ). Now, just note that n1X|npXnp = S11 and n
1X|pXnp =
s|12, and problems (31) and (32) are equivalent, provided that = =K22.
Thus, the columns of the concentration matrix (with a removed diagonal
term) inferred from the penalized maximum pseudo-likelihood problem (29),
and those inferred with Meinshausen and Buhlmann's approach, share exactly
the same null-entries, that is, the same network of conditional dependencies.
Page 21
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 21
5. Numerical experiments
In this section we present numerical experiments on both synthetic data, to in-
vestigate how well the proposed selection procedure behaves, and real data, to
demonstrate the practical use of GGM covariance selection with latent struc-
ture. In the remainder of this section we focus on an aliation model (5), the
choice of the penalty being made in line with Section 3.3. More precisely, we x
the ratio in=out = 1:2 and either let the value 1=in vary when considering
precision/recall curves for synthetic data, or x this parameter according to (28)
when dealing with real data.
5.1. Synthetic data
We perform numerical experiments to assess the performance of our approach
(SIMoNe, Statistical Inference for Modular Network) and compare it to already
existing methods for GGM covariance selection: GLasso (Friedman et al. 2008)
and GeneNet (Schafer and Strimmer 2005).
Data synthesis in our framework requires the simulation of a structured sparse
inverse covariance matrix. To this aim, we rst simulate a graph with an ali-
ation structure. We consider a simple binary aliation model where two types
of edges exist: edges between nodes of the same class and edges between nodes
of di erent classes. The binary incidence matrix of the graph is transformed
by randomly
ipping the sign of some elements in order to simulate both pos-
itively and negatively correlated variables. Positive de niteness of this matrix
is ensured by adding a large enough constant to the diagonal. The matrix is
then further normalized to have a diagonal of ones. A Gaussian sample of size
n with zero mean and the above covariance matrix is then simulated 50 times.
The results we present below are averaged over the 50 samples. At the end of
this section we discuss the performances of our method when there is no latent
structure on the data.
(a) (b) (c)
Fig 1. Simulation of the structured sparse concentration matrix. Adjacency matrix without
(a) and with (b) rows and columns reorganized according the aliation structure and corre-
sponding graph (c).
We simulate sparse graphs with p = 200 and n from 100 to 2000 (n=p 2
f1=2; 2; 3; 6; 10g). We use a probability of intra-cluster connection of 0:125, a
probability of inter-cluster connection of 0:0025, Q = 3 groups and equal group
Page 22
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 22
proportions i = 1=3. With these settings, the theoretical expected number of
edges is about 862 and the total number of potential edges is 19900. A sample
graph is given in Figure 1. The running times of GLasso and SIMoNe are of the
same order. For the settings described above the running time varies from a few
seconds to a few minutes, according to the penalty parameter.
We focus the experiments on the ability to recover existing edges of the net-
work, that is the nonzero entries of the concentration matrix. This is a binary
decision problem where the compared algorithms are considered as classi ers.
The decision made by a binary classi er can be summarized using four num-
bers: True Positives (TP ), False Positive (FP ), True Negatives (TN) and False
Negatives (FN). We have chosen to draw precision/recall curves to display this
information and compare how well the methods perform (Figure 2).
Precision (TP=(TP + FP )) is the ratio of the number of true nonzero el-
ements to the total number of nonzero elements in the estimated concentra-
tion matrix bK. Recall that (TP=(TP + FN)) is the ratio of true nonzero el-
ements in bK to all nonzero entries of the real concentration matrix K. In a
sparse context where the number of actual positives (TP + FN) is small com-
pared to the number of actual negatives (FP + TN), precision/recall curves
give a more informative picture of an algorithm's performance than classical
Receiver Operator Characteristic (ROC) curves. Indeed, ROC curves plot the
False Positive Rate (FPR = FP=(FP + TN)) against the True Positive Rate
(TPR = TP=(TP + FN)). When the number of total positives is small com-
pared to the number of total negatives, small variations of FP and TP will
result in small variations of FPR and large variations of TPR, which is not
relevant for comparing performances. In a statistical framework, the recall is
equivalent to the power and the precision is equivalent to one minus the False
Discovery Proportion.
Additionally to the GLasso (Friedman et al. 2008) and GeneNet (Schafer and
Strimmer 2005) we consider two other procedures:
 When n is greater than p, a straightforward way to obtain an estimate of
the inverse covariance matrix is to invert the empirical covariance matrix.
Although this approach is unlikely to perform well in a selection context
(since it is designed for estimation purposes), it is worth comparing it to
its competitors in order to assess the scale of improvement. We call this
procedure InvCor.
 When the latent structure Z of the concentration matrix is known, our
method can be applied without its E-step and produce a relevant selec-
tion of the nonzero entries of the concentration matrix. This approach
represents the upper limit of our method, since it makes use of an usu-
ally unavailable source of information. This procedure is denoted perfect
SIMoNe.
In some problems the latent structure of the graph is partially known and
this information can be used in the E-step to improve the estimation of the
latent structure. For example, when inferring gene regulation networks, a
subset of identi ed genes may be known to belong to the same functional
module.
The approach of Meinshausen and Buhlmann (2006) was also tested. The
principle of this approach, and the performances obtained are close to those of
Page 23
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 23
GLasso, but it was always slightly outperformed. We have therefore decided, for
the sake of brevity, to report only the four previously described procedures.
For the methods based on penalization (GLasso, SIMoNe and Perfect SIMoNe),
the precision/recall curves are plotted by varying the penalty parameter (namely
1=in in our case). The penalty parameter varies from close to zero to a maxi-
mum value which forces all o -diagonal elements of bK to be null (see Appendix
A.6). The GeneNet and InvCor methods are plotted by sorting the elements of
bK according to their absolute values, and choosing di erent thresholds to nd
nonzero entries.
Even when n is really greater than p (Figures 2 (a-b)) Invcor is always
dominated by the other methods from a selection point of view. This simple
check shows that even in a favorable context with abundant data, penalization
procedures improve the selection of nonzero entries of the concentration matrix,
in comparison with methods based on estimation of these entries.
Although GeneNet and GLasso can provide di erent results on a given run,
both methods perform similarly on average (50 runs for our experiment). The
only parameter we change in this experimental setting is the n=p ratio.
Perfect SIMoNe's curves dominate all other curves for any n=p ratio. This
clearly shows that the knowledge of the structure provides a valuable informa-
tion for selecting the nonzero entries of the concentration matrix. When the
structure is hidden, the main problem of our approach is then to nd a reliable
estimate of this structure from the initial data.
Perfect SIMoNe and SIMoNe perform equivalently when n = 10p and when
the ratio n=p decreases, Perfect SIMoNe tends to outperform SIMoNe more
clearly. This means that SIMoNe is able to recover the latent structure when
there is enough data, but does not nd a substantial structure when n drops
below p.
When p > n, the empirical covariance matrix ceases to be invertible. Thus,
Figures 2 (e-f) do not display the InvCor results. Although it is possible to show
that both GLasso and SIMoNe increase the number of inferred true nonzero
elements with the number of iterations in all settings, precision/recall curves
show the relative poor performances for all tested algorithms when p  n.
Notice that when p > n, the estimated latent structure is not reliable. Never-
theless, the performance of SIMoNe remains comparable to that of GLasso. We
can therefore see that assuming the existence of a latent structure when there
is none does not impair the selection of nonzero entries of the matrix K.
5.2. Breast Cancer data
We tested our algorithm on a gene expression data set provided by Hess et al.
(2006) and concerning 133 patients with stage I III breast cancer. The pa-
tients were treated with chemotherapy prior to surgery. Patient response to the
treatment is classi ed as either a pathologic complete response (pCR) or a resid-
ual disease (not-pCR). Hess et al. (2006) and Natowicz et al. (2008) developed
and tested a multigene predictor for treatment response on this data set. They
focused on a set of 26 genes having a high predictive value (see Table 1). We
thus consider a total of n = 133 cases containing p = 26 gene expression levels.
When dealing with gene regulatory networks, we typically observe n inde-
pendent microarray experiments, each giving the expression levels of the same
Page 25
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 25
p genes. If the same experimental conditions are used for all microarrays, these
may be considered as a sample of the same experiment. In the application in
question, cases from the pCR class (34 cases) and from the not-pCR class (99
cases) clearly do not have the same distribution. We apply our algorithm on
each class of patients. Two distinct gene regulatory networks are thus inferred.
Figure 3 plots the resulting networks obtained for three di erent penaliza-
tions. The penalization parameters were heuristically chosen from the number of
expected nonzero entries. We used Q = 2 latent clusters, and it is interesting to
note that when assuming more than two clusters, the algorithm systematically
produces exactly two non-empty clusters.
The inferred networks exhibit very di erent structures according to the class
of patients. This in itself is interesting and suggests that gene regulation di ers
with respect to the presence or absence of a pCR.
The network obtained with not-pCR cases displays a two-star pattern. Each
star connects to a unique gene, either SCUBE2 or IGFBP4. Almost all the most
signi cant connections imply SCUBE2. This star pattern suggests that further
studies of this particular gene would be of interest for understanding residual
disease.
The network estimated with the pCR cases has a di erent two-cluster struc-
ture. In particular, it groups IGFBP4 and SCUBE2 in the same cluster with a
direct signi cant link. This again indicates a completely di erent relationship
between the genes in pCR versus non-pCR.
Gene symbol Gene name
MAPT Microtubule-associated protein
BBS4 Bardet-Biedl syndrome 4
THRAP2 Thyroid hormone receptor associated protein 2
MBTP-S1 Hypothetical protein
PDGFRA Human clone 23,948 mRNA sequence
ZNF552 Zinc nger protein 552
RAMP1 Receptor (calcitonin) activity modifying protein 1
BECN1 Beclin 1 (coiled-coil, myosin-likeBCL2 interacting protein)
BTG3 BTG family, member 3
SCUBE2 Signal peptide, CUB domain,EGF-like 2
MELK Maternal embryonic leucine zipper kinase
AMFR Autocrine motility factor receptor
CTNND2 Catenin, delta 2
GAMT Guanidinoacetate N-methyl transferase
CA12 Carbonic anhydrase XII
FGFR1OP FGFR1 oncogene partner
KIAA1467 KIAA1467 protein
MTRN Meteorin, glial cell di erentiation regulator
FLJ10916 Hypothetical protein FLJ10916
E2F3 E2F transcription factor 3
ERBB4 V-erb-a erythroblastic leukemiaviral oncogene homolog 4(avian)
JMJD2B Jumonji domain containing 2B
RRM2 Ribonucleotide reductase M2polypeptide
FLJ12650 Hypothetical protein FLJ12650
GFRA1 GDNF family receptor 1
IGFBP4 Insulin-like growth factor binding protein 4
Table 1
The key genes that composed the inferred networks.
Page 26
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 26
not-pCR pCR
L
ow
p
en
al
ty
AMFR
BB_S4
BECNI BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650 GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SI
MELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
AMFR
BB_S4 BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4JMJD2B
KIA1467
MAPT
MBTP_SI
MELK
METRN
PDGFRA RAMPI
RRM2SCUBE2
THRAP2
ZNF552
M
ed
iu
m
p
en
al
ty
AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SI
MELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SI
MELKMETRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
H
ig
h
p
en
al
ty
AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT
MBTP_SIMELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2THRAP2
ZNF552 AMFR
BB_S4
BECNI
BTG3
CA12
CTNND2
E2F3
ERBB4
FGFRIOP
FLJ10916
FLJI2650
GAMT
GFRAI
IGFBP4
JMJD2B
KIA1467
MAPT MBTP_SI
MELK
METRN
PDGFRA
RAMPI
RRM2
SCUBE2
THRAP2
ZNF552
Fig 3. Inferred graphs for three di erent penalization's levels.
Page 27
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 27
References
O. Banerjee, L. El Ghaoui, and A. d'Aspremont. Model selection through sparse
maximum likelihood estimation for multivariate Gaussian or binary data. J.
Mach. Learn. Res., 9:485{516, 2008.
R. Castelo and Alberto Roverato. A robust procedure for Gaussian graphical
model search from microarray data with p larger than n. J. Mach. Learn.
Res., 7:2621{2650, 2006.
Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic
decomposition by basis pursuit. SIAM Rev., 43(1):129{159 (electronic), 2001.
ISSN 0036-1445. Reprinted from SIAM J. Sci. Comput. 20 (1998), no. 1,
33{61 (electronic) [ MR1639094 (99h:94013)].
J.-J. Daudin, F. Picard, and S. Robin. A mixture model for random graphs.
Stat. Comput., 18(2):173{183, 2008.
A. P. Dempster. Covariance selection. Biometrics, Special Multivariate Issue,
28:157{175, 1972.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1):
1{38, 1977. ISSN 0035-9246. With discussion.
Adrian Dobra, Chris Hans, Beatrix Jones, Joseph R. Nevins, Guang Yao, and
Mike West. Sparse graphical models for exploring gene expression data. J.
Multivariate Anal., 90(1):196{212, 2004.
David L. Donoho and Iain M. Johnstone. Adapting to unknown smoothness via
wavelet shrinkage. J. Amer. Statist. Assoc., 90(432):1200{1224, 1995.
Mathias Drton and Michael D. Perlman. Multiple testing and error control in
gaussian graphical model selection. Statist. Sci., 22:430, 2007.
Mathias Drton and Michael D. Perlman. A SINful approach to gaussian graph-
ical model selection. J. Statist. Plann. Inference, 138(4):1179{1200, 2008.
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least
angle regression. Ann. Statist., 32(2):407{499, 2004. With discussion, and a
rejoinder by the authors.
Ove Frank and Frank Harary. Cluster inference by using transitivity indices
in empirical graphs. J. Amer. Statist. Assoc., 77(380):835{840, 1982. ISSN
0162-1459.
J. Friedman, T. Hastie, H. Ho
ing, and R. Tibshirani. Pathwise coordinate
optimization. The Annals of Applied Statistics, 1(2):302{332, 2007.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation
with the graphical lasso. Biostatistics, 9(3):432{441, 2008.
W.J. Fu. Penalized regressions: the bridge versus the lasso. Journal of Compu-
tational and Graphical Statistics, 7(3):397{416, 1998.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer, 2001.
K.R. Hess, K. Anderson, W.F. Symmans, V. Valero, N. Ibrahim, J.A. Mejia,
D. Booser, R.L. Theriault, U. Buzdar, P.J. Dempsey, R. Rouzier, N. Sneige,
J.S. Ross, T. Vidaurre, H.L. Gomez, G.N. Hortobagyi, and L. Pustzai. Phar-
macogenomic predictor of sensitivity to preoperative chemotherapy with pa-
clitaxel and
uorouracil, doxorubicin, and cyclophosphamide in breast cancer.
Journal of Clinical Oncology, 24(26):4236{4244, 2006.
Jan Ihmels, Gilgi Friedlander, Sven Bergmann, Ofer Sarig, Yaniv Ziv, and
Naama Barkai. Revealing modular organization in the yeast transcriptional
Page 28
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 28
network. Nature Genetics, pages 370{377, July 2002.
Jaakkola. Advanced mean eld methods: theory and practice. MIT Press, 2000.
Beatrix Jones, Carlos Carvalho, Adrian Dobra, Chris Hans, Chris Carter, and
Mike West. Experiments in stochastic computation for high-dimensional
graphical models. Statist. Sci., 20(4):388{400, 2005.
Markus Kalisch and Peter Buhlmann. Estimating high-dimensional directed
acyclic graphs with the pc-algorithm. J. Mach. Learn. Res., 8:613{636, Mar
2007.
Ste en L. Lauritzen. Graphical models, volume 17 of Oxford Statistical Science
Series. The Clarendon Press Oxford University Press, New York, 1996. ISBN
0-19-852219-3. Oxford Science Publications.
M. Mariadassou and S. Robin. Uncovering latent structure in valued graphs:
a variational approach. Technical Report 10, Statistics for Systems Biology,
2007.
Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and vari-
able selection with the lasso. Ann. Statist., 34(3):1436{1462, 2006.
M. Minoux. Mathematical programming. Theory and algorithms. John Wiley
and Sons, 1986.
R. Natowicz, R. Incitti, E.G. Horta, B. Charles, P. Guinot, K. Yan, C. Coutant,
F. Andre, and R. Pusztai, L. Rouzier. Prediction of the outcome of a preop-
erative chemotherapy in breast cancer using dna probes that provide infor-
mation on both complete and incomplete response. BMC Bioinformatics, 9
(149), 2008.
A.Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an
algorithm. In NIPS 14, 2002.
Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for
stochastic blockstructures. J. Amer. Statist. Assoc., 96(455):1077{1087, 2001.
ISSN 0162-1459.
M. R. Osborne, Brett Presnell, and B. A. Turlach. A new approach to variable
selection in least squares problems. IMA J. Numer. Anal., 20(3):389{403,
2000a.
Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the LASSO
and its dual. J. Comput. Graph. Statist., 9(2):319{337, 2000b.
Juliane Schafer and Korbinian Strimmer. A shrinkage approach to large-scale
covariance matrix estimation and implications for functional genomics. Sta-
tistical Applications in Genetics and Molecular Biology, 4(1), 2005.
Tom A. B. Snijders and Krzysztof Nowicki. Estimation and prediction for
stochastic blockmodels for graphs with latent block structure. J. Classi -
cation, 14(1):75{100, 1997. ISSN 0176-4268.
C. Tallberg. A Bayesian approach to modeling stochastic blockstructures with
covariates. Journal of Mathematical Sociology, 29(1):1{23, 2005.
Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Roy.
Statist. Soc. Ser. B, 58(1):267{288, 1996.
P. Tseng. Convergence of a block coordinate descent method for nondi eren-
tiable minimization. J. Optim. Theory Appl., 109(3):475{494, 2001.
Anja Wille and Peter Buhlmann. Low-order conditional independence graphs for
inferring genetic networks. Statistical Applications in Genetics and Molecular
Biology, 5(1), 2006.
Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso
penalized regression. Ann. Appl. Stat., 2(1):224{244, 2008.
Page 29
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 29
Ming Yuan and Yi Lin. Model selection and estimation in the Gaussian graphical
model. Biometrika, 94(1):19{35, 2007.
H. Zanghi, F. Picard, V. Miele, and C. Ambroise. Strategies for online inference
of network mixture. Technical Report 14, Statistics for Systems Biology, 2008.
Hui Zou. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc.,
101(476):1418{1429, 2006.
Appendix A: Appendix section
A.1. Proof of the equivalence between the constraints
When kUq`k1  1, we have for each couple (i; j) 2 P
2,


( S)ij


=
2
n
X
q;`
j(Uq`)ij  (Tq`)ij j 
2
n
X
q;`
Tq`;ij = P i j :
Thus kUq`k1  1) k( S)  =Pk1  1.
On the other hand, assume that k( S)  =Pk1  1, that is, for all i; j 2
P, we have
P i j  ( S)ij  P i j :
This also means that there exists some ij 2 [0; 1] such that
( S)ij = ijP i j + (1 ij)(P i j ) =
2
n
X
q;`
(2ij 1)Tq`;ij :
We choose Uql such that (Uql)ij = (2ij 1) for all q; ` 2 Q. Then, since
ij 2 [0; 1], we have
1  (Uq`)ij  1; 8i; j 2 P;
which proves that k( S)  =Pk1  1) kUq`k1  1.
A.2. Fixed-point study
Let us rst introduce some notation. For any i; j 2 P and any q; ` 2 Q, consider
the random variables
Lijq` =
jKij j
q`
+ log 2q`:
Let u : RpQ ! RpQ be de ned by its coordinate functions u = (uiq)i2P;q2Q in
the following way
8a = (aiq)i2P;q2Q 2 RpQ;
uiq(a) = q exp
n

X
j 6=i
X
`
aj`Lijq`
o
= q exp
n

X
j 6=i
X
`
aj`

jKij j
q`
+ log 2q`
o
;
Page 30
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 30
and let g = (giq)i2P;q2Q : RpQ ! RpQ satisfy
8a 2 RpQ; giq(a) =
uiq(a)
P
` ui`(a)
:
According to Proposition 3, the optimal parameter b is a xed-point of g.
Now, let
 =
n
a = (aiq)i2P;q2Q 2 RpQ;8i 2 P; q 2 Q; aiq 2 [0; 1] and
X
q
aiq = 1
o
:
We wish to study the xed-points of g in . First, let us note that as  is a
compact state space and as the function g satis es g : !  and is continuous,
the existence of a xed-point of g follows from Brouwer's Theorem.
We now restrict our attention to a smaller set than the whole state space .
For any " > 0, let
" =
n
a 2 ;8i 2 P; q 2 Q; aiq 2 ["; 1 "]
o
:
Note that we do not claim that g : " ! ". However, the existence of a xed-
point of g is ensured in  and if we assume q > 0 for any q 2 Q (which is a
reasonable assumption if the number of classes Q is not too large), it can easily
be seen that any xed-point satis es aiq > 0, for any i 2 P and any q 2 Q.
Thus for suciently small " > 0, the xed-points of g belong to ".
In order to study the behaviour of g in the vicinity of a xed-point, we need
to look at some kind of contraction property for g. To this end we introduce a
distance d on " that will make use of the form of the state space ". For all
a; b 2 ", denote by ai = (aiq)q2Q 2 RQ and bi = (biq)q2Q 2 RQ. Moreover, let
d(a; b) = max
i2P
d0(ai; bi) = max
i2P
log

maxq2Q aiq=biq
minq2Q aiq=biq

= max
i2P
max
q;`2Q
log

aiqbi`
biqai`

:
It is well known that d0 is a distance in ["; 1 "]Q, and it is easy to check that
the resulting d is also a distance in ".
Now, x a; b 2 RpQ and consider the distance d(g(a); g(b)). It is easily checked
that
d(g(a); g(b)) = max
i2P
d0(gi(a); gi(b)) = max
i2P
d0(ui(a); ui(b)) = max
i2P
d0(ui(a); ui(b));
where u = (ui)i2P = (uiq)i2P;q2Q is de ned in the following way
8a = (aiq)i2P;q2Q 2 RpQ;
uiq(a) = exp
nX
j 6=i
X
`
aj`Lijq`
o
= exp
nX
j 6=i
X
`
aj`

jKij j
q`
+ log 2q`
o
:
In the following, x " > 0 and a; b 2 " and denote by
8i 2 P; ci1 = min
q2Q
aiq
biq
; ci2 = max
q2Q
aiq
biq
:
With these notations, we have
d(a; b) = max
i2P
d0(ai; bi) = max
i2P
log

ci2
ci1

: (33)
Page 31
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 31
We only consider the aliation model described in (5). Thus, there are only
two di erent values for q`, namely in and out for intra and extra cluster
connectivity.
Lemma 3. If for any i; j 2 P; i 6= j and any  2 fin; outg, we have
0 <
jKij j

+ log 2 <
"
2(p 1)(1 + ")
almost surely, (34)
then the function g satis es a contraction property on ".
Before proving the lemma, let us explain the consequences of this result.
Consider the function hK de ned on (0;+1) by
hK() =
jKj

+ log 2:
This function rst decreases from +1 to the value 1 + log 2jKj on the interval
(0; jKj) and then increases from 1 + log 2jKj to +1 on (jKj;+1).
At any step of the algorithm, if the current values K(m)ij of the concentration
matrix are small enough, namely smaller than 1=(2e) ' 0:184 then the functions
h
K(m)ij
take all the values between 1+log 2jKj < 0 and +1. Thus, there is room
for choosing in; out such that (34) is satis ed. In such a case, the xed-point
we are looking for is unique and the iterative procedure setting b (s+1) = g(b (s))
converges.
Proof. Using that for any j 2 P and any ` 2 Q, we have cj1bj`  aj`  c
j
2bj`
and Lijq` > 0, we get
exp
0
@
X
j 6=i
cj1
X
`
bj`Lijq`
1
A  uiq(a)  exp
0
@
X
j 6=i
cj2
X
`
bj`Lijq`
1
A :
Thus, it follows
exp
0
@
X
j 6=i
(cj1 1)
X
`
bj`Lijq`
1
A 
uiq(a)
uiq(b)
 exp
0
@
X
j 6=i
(cj2 1)
X
`
bj`Lijq`
1
A :
(35)
In the case of the aliation model, for xed i; j 2 P and q 2 Q, the set of
random variables fLijq`g`2Q is reduced to only two random values, namely
Linij =
jKij j
in
+ log 2in; L
out
ij =
jKij j
out
+ log 2out:
For the sake of simplicity, we assume Q = 2 groups (our arguments may be
easily generalized to 3 groups or more). Now, denoting Lmaxij = max(L
in
ij ; L
out
ij )
and Lminij = min(L
in
ij ; L
out
ij ), it can easily be seen that (for " < 1=2),
sup
b2"
X
`
bj`Lijq` = (1 ")L
max
ij + "L
min
ij
inf
b2"
X
`
bj`Lijq` = (1 ")L
min
ij + "L
max
ij ;
Page 32
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 32
almost surely. Note that if we have Q  3 groups, explicit bounds can also be
obtained (their expression is only slightly more complicated). Coming back to
(35), we get
exp
0
@
X
j 6=i
(cj1 1)f(1 ")L
min
ij + "L
max
ij g
1
A

uiq(a)
uiq(b)
 exp
0
@
X
j 6=i
(cj2 1)f(1 ")L
max
ij + "L
min
ij g
1
A :
This leads to
d0(ui(a); ui(b)) = log
maxq2Q uiq(a)=uiq(b)
minq2Q uiq(a)=uiq(b)

X
j 6=i
(cj2 1)f(1 ")L
max
ij + "L
min
ij g
X
j 6=i
(cj1 1)f(1 ")L
min
ij + "L
max
ij g

X
j 6=i
Lmaxij fc
j
2 1 "(c
j
2 + c
j
1 2)g+ L
min
ij f1 c
j
1 + "(c
j
2 + c
j
1 2)g:
Finally, recall that d(g(a); g(b)) = maxi d0(ui(a); ui(b)), leading to
d(g(a); g(b))  max
i2P
n
ci2 1 "(c
i
2 + c
i
1 2)

_

1 ci1 + "(c
i
2 + c
i
1 2)
o
max
i2P
X
j 6=i
(Lmaxij + L
min
ij ):
Now, using the inverse triangle inequality, and the fact that ci1  1  c
i
2, we get
for any i 2 P,
jci2 + c
i
1 2j =

jci2 1j j1 c
i
1j

 jci2 c
i
1j = c
i
2 c
i
1:
Moreover, we have 0  ci2 1  c
i
2 c
i
1 and 0  1 c
i
1  c
i
2 c
i
1. This leads to
d(g(a); g(b))  (1 + ") max
i2P
(ci2 c
i
1)max
i2P
X
j 6=i
(Lmaxij + L
min
ij )
 (1 + ") max
i2P
(ci2 c
i
1) 2(p 1) max
j 6=i
Lmaxij : (36)
Since a and b belong to ", we get that ci1; c
i
2 2 ["; "
1] and thus
ci2 c
i
1 = exp(log c
i
2) exp(log c
i
1) 
1
"
log

ci2
ci1

:
In particular, recalling (33), we have
0  max
i2P
ci2 c
i
1 
1
"
d(a; b):
Coming back to (36), we get
d(g(a); g(b))  (1 + "1)2(p 1)

max
j 6=i
Lmaxij

d(a; b): (37)
Now, under assumption (34) the multiplicative random factor (1 + "1)2(p
1) maxj 6=i Lmaxij is strictly smaller than 1.
Page 33
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 33
A.3. Proof of Lemma 2 (Lasso with pathwise coordinate
optimization)
The following is partly based on Friedman et al. (2007). There are various al-
gorithms for solving the Lasso problem. When there is just one predictor, the
Lasso solution is simply given by soft-thresholding (Donoho and Johnstone
1995). The approach used here is based on iterative soft-thresholding with a
\partial residual" as a response variable.
The usual formulation of the Lasso problem is the minimization with respect
to of the quantity
1
2
nX
i=1
0
@yi
pX
j=1
xij j
1
A
2
+ k k`1 ; (38)
where (yi)i=1;:::;n is a vector of response and (xij)i=1;:::;n;j=1;:::;p a matrix of
predictors such that
P
i xij = 0, with no loss of generality. Using a coordinate-
descent approach, we simply write the problem (38) in the form
1
2
nX
i=1
0
@yi
X
k 6=j
xik k xij j
1
A
2
+ 
X
k 6=j
j kj+ j j j
and minimizing this function with respect to j will lead to the solution
j() = S

nX
i=1
xij(yi ~y
(j)
i ); 
!
N2j ;
where ~y(j)i =
P
k 6=j xik k(), the normalizing term N
2
j satis es N
2
j =
Pn
i=1 x
2
ij
and the function S(x; ) = sgn(x)(jxj )+ is the soft-thresholding operator.
This leads to an iterative procedure, repeated on each coordinate of until
stabilization of the full vector. Note that as each coordinate-wise solution is
unique, results from Tseng (2001, Theorem 4.1) imply that the procedure con-
verges.
Now, we want to apply this approach to solve the problem (21), which can
be written
min

1
2




1
p
2
b
1=2
11
p
2b
1=2
11 s12




2
2
+ kp12 ? k`1 : (39)
From the previous lines, the solution for jth entry of is
j(p12) = S
0
@
X
i
(b
1=2
11 )ij
0
@(b
1=2
11 s12)i
1
2
X
k 6=j
(b
1=2
11 )ik k(p12)
1
A ; (p12)j
1
AN2j :
Then, using the symmetry of the matrices, it is easy to see that
P
i(b
1=2
11 )ij(b
1=2
11 s12)i =
P
`(b
1=2
11
b
1=2
11 )j`(s12)` = (s12)j ;
P
i(b
1=2
11 )ij
P
k 6=j(b
1=2
11 )ik k(p12) =
P
k 6=j(b11)jk k(p12);
N2j =
P
i

(b
1=2
11 )ijp
2
2
= (b11=2)jj :
Page 34
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 34
Finally, the solution to (21) is computed by updating the jth coordinate of
via
j(p12) = 2S
0
@(s12)j
1
2
X
k 6=j
(b11)jk k(p12); (p12)j
1
A =(b11)jj ;
and permuting the rows of b until convergence.
A.4. Reconstruction of the concentration matrix
At the end of the block-wise resolution algorithm, a solution b is available. In
order to recover bK, we simply use the fact that b bK = I. Block-wisely, we get
bK12 = b
1
11 b12K22 = K22b =2;
bK22 = 1=(b12 b
|
12
b
1
11 b12) = 1=(b12 b
|
12
b =2);
thanks to the fact that b12 = b11b =2.
To perform this inversion, note that we need to stock the successive solutions
b of the penalized regressions along the algorithm.
A.5. Pseudo-likelihood of a Gaussian vector
It is well known that the distribution of Xki conditional on the remaining vari-
ables Xkni is Gaussian with parameters (
k
i ; i) given by
ki = ini
1
niniX
k
ni; i = ii ini
1
nini
|
ini: (40)
Denoting mi = (1i ; : : : ; 
n
i )
|, we get
log ~L(X;K) =
n
2
pX
i=1
log i
pX
i=1
1
2i
(Xi mi)|(Xi mi) + c:
It is easy to see that m|i = ini
1
niniX
|
ni. Then,
log ~L(X;K) =
n
2
pX
i=1
log i

pX
i=1
1
2i
(X|i Xi 2X
|
i Xni
1
nini
|
ini + ini
1
niniX
|
niXni
1
nini
|
ini) + c:
Note that we have n1X|i Xi = Sii, as well as n
1X|i Xni = Sini and n
1X|niXni =
Snini. Thus,
log ~L(X;K) =
n
2
pX
i=1
log i
n
pX
i=1
1
2i
(Sii 2Sini
1
nini
|
ini + ini
1
niniSnini
1
nini
|
ini) + c: (41)
Page 35
hidden
C. Ambroise, J. Chiquet and C. Matias/Inferring sparse GGM with latent structure 35
Recalling that K = 1, and by reordering the rows and columns of the matri-
ces, as well as using a block-wise notation, this becomes

ii ini
|ini nini



Kii Kini
K|ini Knini

=

1 0
0 Ip1

;
where Ip1 is the identity matrix with size p 1. In particular, this leads to the
identity iiKii = 1iniK
|
ini. Thus
ii = (1iniK
|
ini)=Kii: (42)
In the same way, we can easily get that |iniKii = niniK
|
ini and
1nini
|
ini = K
|
ini=Kii: (43)
Using identities (42), (43) and (40), we obtain
i = (1iniK
|
ini)=Kii + iniK
|
ini=Kii = 1=Kii: (44)
Now, coming back to (41) and using the identities (42), (43) and (44), we nally
obtain the desired result
log ~L(X; K) =
n
2
pX
i=1
logKiin
pX
i=1

Kii
2
Sii + SiniKini +
1
2Kii
KiniSniniK
|
ini

+c:
A.6. Penalization upper bound
The following lemma states that if the penalization parameters 1q` and 
1
0
are chosen large enough (according to the observations), then the penalized
estimator obtained from the Lasso-like iteration step has null entries.
Lemma 4. If for any i; j 2 P we have
X
q;`
ZiqZj`
q`

n
2
jSij j; when i 6= j and
1
0

n
2
jSiij; (45)
then the solution b = bK1 of problem (15) satis es bK1 = 0 .
Proof. The sub-gradient equation arising from (15) gives
8i 6= j;
n
2

bK1ij Sij


0
@
X
q;`
ZiqZj`
q`
1
A ij = 0
and 8i 2 P;
n
2

bK1ii Sii


1
0
ii = 0;
where ij 2 sgn( bKij) and thus ij 2 [1; 1]. In particular, we have
8i 6= j;
n
2


bK1ij Sij



0
@
X
q;`
ZiqZj`
q`
1
A and 8i 2 P;
n
2


bK1ii Sij



1
0
:
Now, if the set of penalty parameters satis es the constraint (45), then the
matrix K1 = 0 satis es the sub-gradient equation. Thus, the conclusion comes
from uniqueness of the solution to (15).

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

12 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
58% Ph.D. Student
 
17% Student (Master)
 
8% Post Doc
by Country
 
42% United States
 
17% France
 
17% Netherlands