Re-identification of Individuals in Genomic Datasets Using Public Face Images

DOI10.5281/zenodo.5522953ZenodoMaRDI QIDQ6710390FDO

Authors Bradley A. Malin, Rajagopal Venkatesaramani, Yevgeniy Vorobeychik

Publication date 22 September 2021

Copyright license Creative Commons Attribution 4.0 International

Image-genome pairs in these synthetic datasets were created by combining a subset of the publicly available face image dataset, CelebA, and genotypes from OpenSNP. The genome in a given pair does not correspond to the individual in the image (taken from CelebA), but comes instead from an individual with the same set of phenotypes (taken from OpenSNP). Artificial genotypes were created for each image (genotype refers only to the small subset of SNPs we are interested in) using all available data from OpenSNP where self-reported phenotypes are present. In the Synthetic-Ideal dataset, to each image, we assigned a genotype from OpenSNP that corresponds to an individual with the same phenotypes, such that the probability of the selected phenotypes is maximized, given the genotype.In other words, we picked the genotype from the OpenSNP data that is most representative of an individual with a given set of phenotypes. In the Synthetic-Realistic dataset, to each image, we assigned a genotype from OpenSNP that corresponds to an individual with the same phenotypes, but at random according to the empirical distribution of phenotypes for particular SNPs in our data. Since CelebA does not have labels for all considered phenotypes, 1000 images from this dataset were manually labeled by one of the authors. After cleaning and removing ambiguous cases, the resulting datasets consistof 456 records.

This page was built for dataset: Re-identification of Individuals in Genomic Datasets Using Public Face Images