A unified genealogy of modern and ancient genomes: Unified, inferred tree sequences of 1000 Genomes, Human Genome Diversity, and Simons Genome Diversity Projects

DOI10.5281/zenodo.5495535Zenodo5495535MaRDI QIDQ6682900FDOQ6682900

Dataset published at Zenodo repository.

Swapan Mallick, Ron Pinhasi, Ben Jeffery, David Reich, Ali Akbari, Anthony W Wohns, Nick Patterson, Jerome Kelleher, Gilean McVean, Yan Wong

Publication date: 8 September 2021

Copyright license: Creative Commons Attribution 4.0 International

Description

Unified, inferred tree sequences built fromthe 1000 Genomes phase 3, Human Genome Diversity, and Simons Genome Diversity Projects. Each tree sequence is the arm of an autosome (the short arm of acrocentric chromosomes are not included).Tree sequences were inferred usingtsinferversion 0.2.1,dated usingtsdate version 0.1.4and compressed usingtszip. All data is in GRCh38. The full data pipeline used to generate these tree sequences and associated metadata is available onGitHub. A description can be found in the Supplementary Material of Wohns et al. (2021). Tree sequences can be decompressed as follows: $ tsunzip hgdp_tgp_sgdp_chr1_p.dated.trees.tsz Once decompressed, trees files can be loaded and processed in Python usingtskit. import tskit ts = tskit.load("hgdp_tgp_sgdp_chr1_p.dated.trees") # ts is an instance of tskit.TreeSequence print("The short arm of chromosome 1 contains {} trees".format(ts.num_trees)) Metadata associated with nodes containthe mean and variance of tsdates posterior distribution on node time. To access these values, we can use: import json node = ts.node(10000) metadata_dict = json.loads(node.metadata) print("The mean of the posterior distribution on the age of node 10000 is {} generations".format(metadata_dict["mn"])) print("The variance of the posterior distribution on the age of node 10000 is {} generations".format(metadata_dict["vr"])) Age estimates foreach variant site can be derived from the mean of the age estimates of theupper and lower bounding nodes of the oldest mutation associated with a site. tsdate includes a function to find the age estimates of all sites in the tree sequence: import tsdate site_times = tsdate.sites_time_from_ts(ts, node_selection='arithmetic') This returns a numpy array which has a length equal to the number of sites. Accessing variant sites in the tree sequence providesthe position and id of variants: site = ts.site(1000) site_metadata = json.loads(site.metadata) print("The position of site 1000 is {} and its ID is {}.".format(site.position, site_metadata["ID"])) Metadata associated with individuals and populations was derived from the original sources (TGP, HGDP, and SGDP)and converted to JSON form. For example, to access individual metadata we can use: ind = ts.individual(0) metadata_dict = json.loads(ind.metadata) The metadata_dict variable will now containall the metadata for the individual with ID 0 as a dictionary. Metadata associated with populations can be found in a similar way. Population IDs are associated with individuals via their constituent nodes. For example, pop_metadata = [json.loads(pop.metadata) for pop in ts.populations()] ind_node = ts.node(ind.nodes[0]) ind_pop_metadata = pop_metadata[ind_node.population] After this, theind_pop_metadata variable will contain the population level metadata for individual ID 0.

This page was built for dataset: A unified genealogy of modern and ancient genomes: Unified, inferred tree sequences of 1000 Genomes, Human Genome Diversity, and Simons Genome Diversity Projects