The art of data augmentation

  • Dyk D
  • Meng X
  • 246


    Mendeley users who have this article in their library.
  • 220


    Citations of this article.


The term data augmentation refers to methods for constructing iterative optimization or sampling algorithms via the introduction of unobserved data or latent variables. For deterministic algorithms, the method was popularized in the general statistical community by the seminal article by Dempster, Laird, and Rubin on the EM algorithm for maximizing a likelihood function or, more generally, a posterior density. For stochastic algorithms, the method was popularized in the statistical literature by Tanner and Wong's Data Augmentation algorithm for posterior sampling and in the physics literature by Swendsen and Wang's algorithm for sampling from the Ising and Potts models and their generalizations; in the physics literature, the method of data augmentation is referred to as the method of auxiliary variables. Data augmentation schemes were used by Tanner and Wong to make simulation feasible and simple, while auxiliary variables were adopted by Swendsen and Wang to improve the speed of iterative simulation. In general, however, constructing data augmentation schemes that result in both simple and fast algorithms is a matter of art in that successful strategies vary greatly with the (observed-data) models being considered. After an overview of data augmentation/auxiliary variables and some recent developments in methods for constructing such efficient data augmentation schemes, we introduce an effective search strategy that combines the ideas of marginal augmentation and conditional augmentation, together with a deterministic approximation method for selecting good augmentation schemes. We then apply this strategy to three common classes of models (specifically, multivariate t, probit regression, and mixed-effects models) to obtain efficient Markov chain Monte Carlo algorithms for posterior sampling. We provide theoretical and empirical evidence that the resulting algorithms, while requiring similar programming effort, can show dramatic improvement over the Gibbs samplers commonly used for these models in practice. A key feature of all these new algorithms is that they are positive recurrent subchains of nonpositive recurrent Markov chains constructed in larger spaces.

Author-supplied keywords

  • Auxiliary variables
  • Conditional augmentation
  • Em algorithm
  • Gibbs sampler
  • Haar measure
  • Hierarchical models
  • Marginal augmentation
  • Markov chain Monte Carlo
  • Mixed-effects models
  • Nonpositive recurrent markov chain
  • Posterior distributions
  • Probit regression
  • Rate of convergence

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Get full text


  • David A.Van Dyk

  • Xiao Li Meng

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free