A unified genealogy of modern and ancient genomes: Unified, inferred tree sequences of 1000 Genomes, Human Genome Diversity, and Simons Genome Diversity Projects with ancient samples

DOI10.5281/zenodo.5512994Zenodo5512994MaRDI QIDQ6682907FDOQ6682907

Dataset published at Zenodo repository.

David Reich, Nick Patterson, Ali Akbari, Ron Pinhasi, Gilean McVean, Anthony W Wohns, Jerome Kelleher, Yan Wong, Ben Jeffery, Swapan Mallick

Publication date: 16 September 2021

Copyright license: Creative Commons Attribution 4.0 International

Description

Unified, inferred tree sequences built fromthe 1000 Genomes phase 3, Human Genome Diversity, and Simons Genome Diversity Projects with high coverage sequenced ancient samples. The ancient samples are the Altai, Chagyrskaya, and Vindija Neanderthals, the Denisovan, and a high-coverage family of four from the Afanasievo Culture. Each tree sequence is the arm of an autosome (the short arm of acrocentric chromosomes are not included).Tree sequences were inferred withtsinferversion 0.2.1 andtsdate version 0.1.4, asdescribed in Wohns et al. (2021). The files werecompressed usingtszip. All data is in GRCh38. The full data pipeline used to generate these tree sequences and associated metadata is available onGitHub. A description can be found in the Supplementary Material of Wohns et al. (2021). Tree sequences canbe decompressed as follows: $ tsunzip hgdp_tgp_sgdp_high_cov_ancients_chr1_p.dated.trees.tsz Once decompressed, trees files can be loaded and processed in Python usingtskit. import tskit ts = tskit.load("hgdp_tgp_sgdp_high_cov_ancients_chr1_p.dated.trees") # ts is an instance of tskit.TreeSequence print("The short arm of chromosome 1 contains {} trees".format(ts.num_trees)) Accessing variant sites in the tree sequence providesthe position and id of variants: import json site = ts.site(1000) site_metadata = json.loads(site.metadata) print("The position of site 1000 is {} and its ID is {}.".format(site.position, site_metadata["ID"])) Metadata associated with individuals and populations was derived from the original sources (TGP, HGDP, and SGDP)and converted to JSON form. For example, to access individual metadata we can use: ind = ts.individual(0) metadata_dict = json.loads(ind.metadata) The metadata_dict variable will now containall the metadata for the individual with ID 0 as a dictionary. Metadata associated with populations can be found in a similar way. Population IDs are associated with individuals via their constituent nodes. For example, pop_metadata = [json.loads(pop.metadata) for pop in ts.populations()] ind_node = ts.node(ind.nodes[0]) ind_pop_metadata = pop_metadata[ind_node.population] After this, theind_pop_metadata variable will contain the population level metadata for individual ID 0.

This page was built for dataset: A unified genealogy of modern and ancient genomes: Unified, inferred tree sequences of 1000 Genomes, Human Genome Diversity, and Simons Genome Diversity Projects with ancient samples