Data for Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies

From MaRDI portal
(Redirected from Dataset:6682973)



DOI10.5281/zenodo.8157131Zenodo8157131MaRDI QIDQ6682973FDOQ6682973

Dataset published at Zenodo repository.

Luke O'Connor, Jenna Ballard, Eric Lander, Anthony W Wohns, Pouria Salehi Nowbandegani, Alex Bloemendal, Ben Neale

Publication date: 18 July 2023

Copyright license: Creative Commons Attribution 4.0 International



Data from Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies (2023). This includes linkage disequilibrium graphical models (LDGMs) created fromhigh-coverage 1000 Genomes Project sequencing data. This dataset consists of LDGM precision matrices, LDGM graphical models of SNPs, and lists of SNPs, all split into1,361 approximately independent LD blocks across the genome. The dataset additionally containsgenotype information from chromosomes 21 and 22, and inferred tree sequences of high coverage 1000 Genomes Project Data, summary statistics from four traits in the UK Biobank, and UK biobank correlation matrices from chromosomes 21 and 22. All genomic data is in the GRCh38 build. The data can be cited as follows: Pouria Salehi Nowbandegani, Anthony Wilder Wohns, Jenna L. Ballard, Eric S. Lander, Alex Bloemendal, Benjamin M. Neale, and Luke J. OConnor.Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies. Nat Genet. (2023) DOI: 10.1038/s41588-023-01487-8 The directory contains`.tar.gz` files, which canbe extracted and unzipped with: $ tar -xvf FILENAME.tar.gz All LD block filesare named by chromosome and start/end basepair coordinates. 1kg_nygc_trios_removed_All_pops_geno_ids_pops.csv: The file contains 5008 rows, 2 for each individualin the 1000 Genomes Project. Each row contains the individual ID of the 1000 genomes individual, and the ancestry group and continental ancestry group that individual was assigned to. Rows correspond to columns in `.genos`files. AFR/AMR/EAS/EUR/SAS.precision.tar.gz: Precision matrices for the relevant ancestry group for each LD block. Edge lists contain one row for each non-zero entry of the precision matrix.There are no column names. genos_chr21_22.tar.gz: for the 40 LD blocks on chromosomes 21-22, .genos files are 0/1 matrices, with dimension number-of-SNPs by number-of-samples . Each LD matrix contains one column for each row in the SNP list files, and one row for each row in the sample ID files. ldgms.tar.gz:1361 LDGMs (*.edgelist files).Edge lists contain one row for each non-zero entry of the LDGM adjacency matrix.There is one LDGM edge list for each LD block. Each row represents an edge, as a tuple (index_1, index_2, entry). For the LDGM adjacency matrices, the entry is the edge weight, where 0 represents a strong dependency and e.g. 6 represents a weak dependency. snplists_GRch38positions.tar.gz: 1361 *.snplistfiles, each of which contains information on the SNPs in each LD block. EachSNP list is an nx 11table (n = number of SNPs),one for each LD block. The columns are: index: these non-unique indices, starting at zero, correspond to rows and columns of the LDGMs. There can be multiple SNPs for a single index, which occurs when the corresponding mutations occur on the same brick of the bricked tree sequence. SNPs with the same index have high (nearly perfect) LD. anc_alleles: ancestral allele deriv_alleles: derived allele EUR: allele frequency of derived allele in EUR samples EAS: allele frequency of derived allele in EAS samples AMR: allele frequency of derived allele in AMR samples SAS: allele frequency of derived allele in SAS samples AFR: allele frequency of derived allele in AFR samples site_ids: unique identifier of each SNP, mostly as RSIDs position: GRCh38 position of SNP swap: indicates strandness swap ukb.tar: Correlation matrices and SNP lists for SNPs in the UK Biobank. correlation_matrices/: Correlation matrices for SNPs in the UK biobank, computed by Weissbrod et al. 2020 Nat Genet and can be downloaded by following the instructions here. snplists/: List of SNPs in the *.snplist format included in the UK Biobank tree_seqs.tar:contains 22 tree sequences inferred bytsinferfrom the30x 1000 Genomes Project Data. Tree sequences can be unzipped with tszip. Summary statistics: there are four summary statistics files, obtained fromhttps://alkesgroup.broadinstitute.org/UKBB/, and computed by Loh et al. 2018 Nat Genet. Phenotype Heritability estimate Effective sample size Number of SNPs Height 0.570 650K 12 Million Body mass index 0.303 500K 12 Million Cardiovascular disease 0.155 450K 12 Million Type 2 diabetes 0.073 450K 12 Million







This page was built for dataset: Data for Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies