Background: Large-scale multi-ethnic DNA sequencing data is increasingly available owing to decreasing cost of modern sequencing technologies. Inference of the population structure with such sequencing data is fundamentally important. However, the ultra-dimensionality and complicated linkage disequilibrium patterns across the whole genome make it challenging to infer population structure using traditional principal component analysis based methods and software. Results: We present the ERStruct Python Package, which enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, our package achieves significant improvements in the speed of matrix operations for large-scale data. Additionally, our package features adaptive data splitting capabilities to facilitate computation on GPUs with limited memory. Conclusion: Our Python package ERStruct is an efficient and user-friendly tool for estimating the number of top informative principal components that capture population structure from whole genome sequencing data.
CITATION STYLE
Yang, J., Xu, Y., Yao, M., Wang, G., & Liu, Z. (2023). ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data. BMC Bioinformatics, 24(1). https://doi.org/10.1186/s12859-023-05305-0
Mendeley helps you to discover research relevant for your work.