Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes

From MaRDI portal
(Redirected from Dataset:6708207)



DOI10.5281/zenodo.5774192Zenodo5774192MaRDI QIDQ6708207FDOQ6708207

Dataset published at Zenodo repository.

Olivier Tenaillon, M. Weigt, Marie Petitjean, Etienne Ruppé, Lucile Vigué, Giancarlo Croce

Publication date: 11 December 2021

Copyright license: Creative Commons Attribution 4.0 International



We use computational models based on Direct Coupling Analysis - DCA - trained on PFAM domains of distant distant homologues to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes. We show that the genetic context (i.e. the rest of the protein sequence) strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. Our study also suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions. Please refer to the README file for additional information on the structure of this dataset. Code to analyse this dataset is available at https://github.com/GiancarloCroce/DCA_polymorphism_Ecoli.







This page was built for dataset: Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes