Representing degree distributions...
Social Networks 31 (2009) 204���213 Contents lists available at ScienceDirect Social Networks j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / s o c n e t Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models Pavel N. Krivitsky ���,1,2 , Mark S. Handcock 1, Adrian E. Raftery 3, Peter D. Hoff 4 University of Washington, Seattle, United States a r t i c l e i n f o Keywords: Bayesian inference Latent variable Markov chain Monte Carlo Model-based clustering Small world network Scale-free network a b s t r a c t Social network data often involve transitivity, homophily on observed attributes, community structure, and heterogeneity of actor degrees. We propose a latent cluster random effects model to represent all of these features, and we develop Bayesian inference for it. The model is applicable to both binary and non-binary network data. We illustrate the model using two real datasets: liking between monks and coreaderships between Slovenian publications. We also apply it to two simulated network datasets with very different network structure but the same highly skewed degree sequence generated from a preferen- tial attachment process. One has transitivity and community structure while the other does not. Models based solely on degree distributions, such as scale-free, preferential attachment and power-law models, cannot distinguish between these very different situations, but the latent cluster random effects model does. �� 2009 Elsevier B.V. All rights reserved. 1. Introduction Social network data consist of data about pairs of actors or nodes. Often these data represent the presence, absence, or value of a rela- tionship between pairs of actors, such as liking, respect, familial relationship, shared membership in a group of individuals, or vol- ume of trade for collectivities such as countries or companies. In this article we primarily consider binary social network data, rep- resenting presence or absence of a relationship, and count data, representing the number of times a relationship between a pair of actors was observed. The methods we develop can also be extended to accommodate for other types of relational data. Much social network data share a number of features. One of these is transitivity, for example the fact that if actor A relates to actor B and actor B relates to actor C, then actor A is more likely to relate to actor C. Another is homophily on observed attributes, according to which actors with similar characteristics are more likely to relate. A third feature is clustering, in which actors clus- ter into groups such that ties are more dense within groups than ��� Corresponding author at: University of Washington, Department of Statistics, Box 354322, Seattle, WA 98195-4322, USA. Tel.: +1 206 543 8797 fax: +1 206 685 7419. E-mail address: pavel@stat.washington.edu (P.N. Krivitsky). 1 Supported by NIDA Grant DA012831, DoD ONR MURI award N00014���08���1���1015 and NICHD Grant HD041877. 2 Supported by NIH Grant 8 R01EB 002137���02 and NSF Grant 0729438. 3 Supported by NIH Grant 8 R01EB 002137���02 and NICHD Grant R01 HD054511. 4 Supported by NSF Grant 0631531. between them. It has also been referred to as community structure (Newman, 2003). This can be due to social self-organization or to homophily on unobserved attributes, such as interest in the same sport, about which the analyst might not have information. A fourth feature is degree heterogeneity, namely the tendency of some actors to send and/or receive links more than others. Hoff et al. (2002) proposed the latent space model for social networks. This postulates an unobserved Euclidean social space in which each actor has a position. The probability of a link between pairs of actors depends on the distance between them in the space and on their observed characteristics. Inference for the model involves estimating both the characteristics of the latent positions and the parameters of the model specifying how the probabil- ity of a link depends on distance and observed attributes. This accounts for transitivity automatically through the latent space and is flexible enough to include the other common features of social network data. This model was extended by Handcock et al. (2007) ��� hereafter HRT ��� to include model-based clustering of the latent space positions, giving a way to detect groups of actors, or so-called community structure. Hoff (2005) added random sender and receiver effects to model inhomogeneity of the actors, simi- lar to those in the p2 model (van Duijn et al., 2004), and described its generalized linear model formulation, applying it to non-binary data. No model so far proposed has modeled all the four common fea- tures of social network data noted above: homophily, transitivity, community structure and heterogeneity in actor degrees. In this paper, we propose the latent cluster random effects model, which explicitly models all four features by adding the random sender and 0378-8733/$ ��� see front matter �� 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.socnet.2009.04.001
P.N. Krivitsky et al. / Social Networks 31 (2009) 204���213 205 receiver or sociality effects as proposed by Hoff (2005) to HRT���s latent position cluster model. We apply it to count data as well as binary network data. In Section 2, we introduce the latent cluster random effects model. In Section 3, we describe our Bayesian method for estimat- ing it using Markov chain Monte Carlo, as well as heuristics for prior and starting value selection. In Section 4 we illustrate the model using two real network datasets, one binary and the other consist- ing of counts. We also apply our method to two simulated networks with the same, highly skewed degree distribution, but very different network behaviors: one unstructured and the other exhibiting tran- sitivity and clustering. Currently popular methods based on degree distributions cannot distinguish between these situations, but our model does. 2. The latent cluster random effects model for social networks We first review the latent position cluster model of HRT, and then expand it to allow for actor-specific random effects. The data we model consist of yi,j , the value of the relation from actor i to actor j for each dyad consisting of two of the n actors. These form the elements of the n �� n sociomatrix Y. There may also be dyadic-level covariate information represented by p matrices x = {xk }p k=1 ��� Rn��n��p. Both directed and undirected relations can be analyzed with our methods, although the models are slightly dif- ferent in the two cases. The model posits that each actor i has an unobserved position, Zi, in a d-dimensional Euclidean latent social space, as in Hoff et al. (2002) and HRT. We then assume that the tie values are stochasti- cally independent given the distances between the actors��� positions. Specifically, for binary data, logit (p(Yi,j = 1|Z, x, ��)) ��� i,j = p k=1 ��k xk,i,j ��� Zi ��� Zj , (1) where logit (p) = log(p/(1 ��� p)) and �� denotes a vector of regression parameters to be estimated. The model accounts for transitiv- ity, homophily on the observed attributes x, as well potential homophily on unobserved attributes via the latent space. As in HRT, we allow for clustering in the Zi via a finite spherical multivariate normal mixture: Zi i.i.d. ��� G g=1 g MVNd( g , 2 g Id) i = 1, . . . , n, (2) where g is the probability that an actor belongs to the gth group, so that g ��� 0 (g = 1, . . . , G) and ���G g=1 g = 1, and Id is the d �� d identity matrix. Thus the position of each actor is drawn from one of G groups, where each group is centered on a different mean and dispersed with a different variance. To represent heterogeneity in the propensity for actors to form ties not captured by the dyad-level covariates or actor positions, we introduce actor-specific random effects. The nature of the effects differs for directed and undirected relationships. For an undirected relationship, each actor i has a latent ���sociality��� denoted by ��i, rep- resenting his or her propensity to form ties with other actors. The effect of these random effects on the propensity to form ties is modeled as follows: i,j = p k=1 ��k xk,i,j ��� Zi ��� Zj + ��i + ��j . (3) The sociality ��i is then the conditional log-odds ratio of an actor i having a tie with another actor compared to an actor with similar position and covariates but having �� = 0. This model can also be used for directed relationships. In that case we define both sender and receiver random effects, ��i and i , representing actor i���s propensity to send and receive links, respec- tively. The model then becomes: i,j = p k=1 ��k xk,i,j ��� Zi ��� Zj + ��i + j , (4) where ��i i.i.d. ��� N(0, 2) �� i = 1, . . . , n, i i.i.d. ��� N(0, 2) i = 1, . . . , n, and the variances 2 �� and 2 measure heterogeneity in the propen- sity to send and receive links. The use of random effects in the latent space model was proposed by Hoff (2003), and van Duijn et al. (2004) who made a similar proposal for the p2 model. 3. Estimation 3.1. Bayesian estimation and prior distributions We propose a Bayesian approach to estimate the latent cluster random effects model given by (1), (2), and either (3) or (4). The approach estimates the latent positions, the clustering model and the actor-specific effects simultaneously. We implement the meth- ods computationally using a Markov chain Monte Carlo (MCMC) algorithm. We introduce the new variables Ki, equal to g if the ith actor belongs to the gth group, as is standard in Bayesian estimation of mixture models (Diebolt and Robert, 1994). We specify prior distri- butions as follows: �����MVNp(, ), ���Dirichlet( ), 2������� �� 2 0,�� Inv 2 ���� , 2����� 2 0, Inv 2 �� , 2 g i.i.d. ��� ��Z 2 0,Z Inv 2 ��Z g = 1, . . . , G, g i.i.d. ��� MVNd(0, ��2Id), g = 1 . . . G, where , , = ( 1, . . . , G ), 2 0,Z , ��Z , 2 0,�� , ����, 2 0, , �� , and ��2 are hyperparameters to be specified by the user. We set g equal to the smallest group size we are willing to con- sider for the network of interest, and = 0 and = 9I, which allows a wide range of values of ��. The other hyperparameters are not so clear-cut. Heuristically, networks with larger clusters call for greater prior variances, and it is helpful to have slightly stronger priors for larger clusters, but as a network gets larger, the role of the prior variances in determining the posterior variances should decline. The hyperparameter choices we use reflect these intuitions. This is discussed in more detail by Krivitsky and Handcock (2008a), and we use the hyperparameters 2 0,Z = (1/8) d/2 (n/G), ��Z = (n/G), ��2 = (1/4) d/2 ��� n, and g = (n/G). 3.2. Markov chain Monte Carlo algorithm Our MCMC algorithm iterates over the model parameters with the priors given above, the latent positions Zi, the random effects ��i and i , and the group memberships Ki. We update variables in turn, and block-update those we expect to be highly corre- lated. For those variables for which a conjugate prior was specified, full conditional updates are used. The others are updated using Metropolis���Hastings. We describe these in turn.