Efficiently summarizing relationships in large samples: A general duality between statistics of genealogies and genomes

35Citations
Citations of this article
74Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.

References Powered by Scopus

PLINK: A tool set for whole-genome association and population-based linkage analyses

24431Citations
N/AReaders
Get full text

A global reference for human genetic variation

11641Citations
N/AReaders
Get full text

The UK Biobank resource with deep phenotyping and genomic data

4648Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic

206Citations
N/AReaders
Get full text

Efficient ancestry and mutation simulation with msprime 1.0

114Citations
N/AReaders
Get full text

SLiM 4: Multispecies Eco-Evolutionary Modeling

75Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Ralph, P., Thornton, K., & Kelleher, J. (2020). Efficiently summarizing relationships in large samples: A general duality between statistics of genealogies and genomes. Genetics, 215(3), 779–797. https://doi.org/10.1534/genetics.120.303253

Readers over time

‘19‘20‘21‘22‘23‘24‘25010203040

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 32

67%

Researcher 11

23%

Professor / Associate Prof. 5

10%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 22

49%

Biochemistry, Genetics and Molecular Bi... 16

36%

Mathematics 4

9%

Medicine and Dentistry 3

7%

Article Metrics

Tooltip
Mentions
News Mentions: 1

Save time finding and organizing research with Mendeley

Sign up for free
0