Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution
DOI10.5281/zenodo.3839790Zenodo3839790MaRDI QIDQ6710694FDOQ6710694
Dataset published at Zenodo repository.
Tom A. Williams, Bui Quang Minh, Chris Rinke, Sun Jiarui, Anja Spang, Jun-Hoe Lee, Nina Dombrowski, Benjamin Woodcroft
Publication date: 18 February 2020
Copyright license: Creative Commons Attribution 4.0 International
General Description Repository with all analyses described our paper: Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution. If you find this work useful for your own analyses, please cite this work. Abstract The evolution and diversification of Archaea is central to the history of life on Earth. Cultivation-independent approaches have revealed the existence of the DPANN archaea: a radiation of organisms with small cell and genome sizes. Currently, the placement of the various DPANN lineages and in turn the early evolution of metabolism and symbiosis are debated. Here, we reconstructed genomes of a thus far uncharacterized archaeal phylum-level lineage UAP2 (Candidatus Undinarchaeota). Comparative genomics revealed that members of the Undinarchaeota have small estimated genome sizes and, while potentially being able to conserve energy through fermentation, likely depend on partner organisms for the acquisition of vitamins, amino acids and other metabolites. In contrast to previous indications, our phylogenomic analyses robustly placed the Undinarchaeota as independent lineage between two major and highly supported clans of DPANN. Furthermore, our work suggests that DPANN archaea have exchanged core genes with their hosts by horizontal gene transfer, adding to the difficulty of placing DPANN in the tree of life (ToL). In several cases, this pattern is sufficiently dominant that known symbiont-host clades can be identified by inferring routes of HGT across the ToL. Together, our findings provide crucial insights into the origins and evolution of DPANN archaea and their hosts. The annotation workflow for archaeal/bacterial genomes that was used for this paper is also available on github (here) and an updated version that includes the COG search is available on:https://github.com/ndombrowski/Annotation_workflow Repository Contents 1_Genome_files.tar.gz includes all Undinarchaeota (original name UAP2) metagenome-assembled genomes (MAGs). This includes: The original contigs for each UAP2 MAG (fna files) The prokka output for each UAP2 MAG (faa files) A concatenated file of all proteins from each UAP2 MAG and all archaeal reference genomes (364 genomes in total). This folder also includes a list of archaeal genomes investigated. 2_Phylogenies.tar.gz includes all files for the phylogenetic analyses. This includes the following folders: 1. Files for the concatenated species trees for different taxa sets. These files are related to the following parts of the manuscript: Supplementary Table 6; Figure 1 andSupplementary Figures S8-S58. The folder includes the following: Folder 1_unaligned_sequences includes individualprotein sequences extract from the different taxa sets. Folder 2_alignments includes the alignment files generated by MAFFT. Folder 3_alignments_trimmed includes the alignments trimmed with BMGE. Folder 4_phylogenies includes the IQ-TREE output for all phylogenies as well as color-annotation file for figtree. Additionally files rooted with minimal ancestor deviation (MAD) rooting (*.rooted) are provided. Note, that for the final figures the *treefile_renamed (i.e. the iqtree file with the full taxa string)were artificially rooted using the DPANN archaea.The numbering corresponds to Supplementary Table S6 of the main manuscript. Folder 5_pdfs includes the PDFs for each tree 2.Files for single gene trees that includes: The folder 1_arcogs includes the unaligned proteins, alignments, trimmed alignments, trees and pdfs for the single gene trees based on the arCOGidentifiers. The arCOGs were extract from 12 UAP2 MAGs + 352 archaeal + 3020 bacterial + 100 eukaryotic genomes. ArCOGs were only considered if they occurred in at least 3 UAP2 genomes. Notice, these files were used to investigate UAP2 for HGT events and correspond to the following parts of the manuscript: Figure 4 and Supplementary Tables 4, 5, 20-22. Additionally, the folder 0_parsing includes some information on how to generate count tables for each marker gene. The folder 151_markers including theproteins, alignments, trimmed alignments, trees and pdfs for evaluating the 151 marker set used for the concatenatedspecies tree. Files were provided for the 127 and 364 taxa set. These files were used as a basis for the concatenated species trees that were used to generated Supplementary Figures S8-S58. Additionally, the trees were used for ranking marker proteins and generating Supplementary Tables 4-5. For the 364 taxa set, the folder also included a subfolder 0_parsing that provides scripts to investigate some statistics for each marker protein, including the average protein length, average alignment length andaverage bootstrap support. The folder 3_other_individual_trees includes theproteins, alignments and phylogenies for the 16S_23S, RubisCO and primase analyses. The data was used to generate the following parts of the manuscript: Supplementary Table 11, Supplementary Figures 3-5, 57 and 59. 3_Scripts.tar.gz includes all files for the phylogenetic analyses. This includes the following folders: 1. The files for the main workflow for the annotations and phylogenies. This folder includes the workflow to generate annotationsfor archaeal genomes as well as an example script that was usedto generate phylogenies. These analyses were typically run on a in-house bioinformatics cluster with 4x Xeon Gold 6140 2.3 GHz processors using bash, python and perl. The used system runs a Linux operating system, Red Hat Enterprise 7.5. 2. A folder providing any required dependencies that include: any python or perl scripts that wereused during this study and/or that are mentioned in the methods section Databases used for the annotations, esp. if these were slightly modified. Notice, changes typically includeparsing of the mapping files or modifications of the sequence headers for easier parsing. mapping files needed to link the genome accession ids to the taxonomy string as well as lists of protein IDs used for different phylogenies (i.e. 14 + 48 arCOGs used for protein phylogenies) 3. R scripts (including all needed input files)used to: generate tables and figures for the annotations, i.e. Figure 2 and 3 and Supplementary Tables 7, 8, 9, 12, 13-15 and Supplementary Figures 60, 62-64. The input folder includes the raw output from the annotation workflow and includes annotations for the 12 UAP2 MAGs as well as 352 archaeal reference genomes. generate tables and figures for the HGT analyses, i.e. Figure 4 and Supplementary Tables S20-22 Here, proteins based on arCOGs were extracted from 364 archaeal, 3020 bacterial and 98 eukaryotic genomes and used to generate single protein phylogenies. The resulting trees were used to investigate horizontal gene transfer events and the necessary scripts are provided in this folder. generate tables and figures for the amino acid identify (AAI) comparisons, i.e. Supplementary Table S3 and Supplementary Figure S2. rank the marker genes for concatenated species trees for the 127 and 364 taxa set. These were used to generate Supplementary Tables S4 and S5. General comment: In contrast to the previous version, this datasets includes some small additional scripts generated during the revision process of the corresponding manuscript.
This page was built for dataset: Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution