Protein language model embeddings and predictions for the fly proteome (FlyBase)

DOI10.5281/zenodo.6322184Zenodo6322184MaRDI QIDQ6718863FDOQ6718863

Dataset published at Zenodo repository.

Céline Marquet, Christian Dallago, Burkhard Rost

Publication date: 2 March 2022

Residue and sequence embeddings of the fly (drosophila melanogaster) proteome (FlyBase for organism drosophila melanogaster, downloaded on2022.03.01)computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). To open the embeddings file, please seethis notebook. The embeddings will be indexed by numbers according to the mapping file (mapping_file.csv)in this dataset. All followingresults will share the same mapping (for instance, to access the variation prediction results, by accessing index 0, you will query results for thesequence FBpp0304622). Additionally: - Sequence-levelpredictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reportedin the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3) - Residue-level prediction of conservation (in 9 states) and of variation effect (from 0 [no-effect] to 1 [effect]) using VESPAl(https://doi.org/10.1007/s00439-021-02411-y) Files included: - dmel-all-translation-r6.44.fasta -- FASTA-formatted sequences of drosophila melanogaster from FlyBase - mapping_file.csv -- A CSV file mapping the identifiers used in the following files (from 0 to30737) to the identifiers in the FlyBase fasta file (dmel-all-translation-r6.44.fasta). -DSSP3_fly_ProtT5Sec.fasta -- Secondary structure predictions in three states for each residue of each proteinin dmel-all-translation-r6.44.fasta. H stands for Helix; E stands for Sheet; C stands for Other. -subcell_fly_LA_ProtT5.csv -- Subcellular location (10 states) and memrane-boundness (2 states)for each protein in dmel-all-translation-r6.44.fasta -embeddings_file.h5 -- per-residue embeddings of sequences in dmel-all-translation-r6.44.fasta. Each datasetin the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the original_id attribute. Seehttps://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file. -reduced_embeddings_file.h5 -- per-sequence embeddings of sequences in dmel-all-translation-r6.44.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each datasetin the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension). -conspred_probs.h5 -- per-sequence conservation probability (softmax) prediction of sequences in dmel-all-translation-r6.44.fasta in 9 classes. Each datasetin the .h5 file represents a protein sequence and contains a matrix of length 9xL, with L being the length of the protein sequence, and 9 being the predicted conservation class (index 0 = very variable; index 8 = very conserved) -vespal_SAVeffect_fly.zip -- zipped .h5 file of per-sequence variation predictionsof sequences in dmel-all-translation-r6.44.fasta on a scale from 0 (neutral) to 1 (effect). -1 indicates WT substitution.Each datasetin the .h5 file represents a protein sequence and contains a matrix of length 20xL, with L being the length of the protein sequence, and 20being the predicted variation score for each residue substitution(AAs in the following order: ALGVSREDTIPKFQNYMHWC . Meaning that index 0 = substitution of the residue to A, index = 1 substitution to residue L, aso.)

This page was built for dataset: Protein language model embeddings and predictions for the fly proteome (FlyBase)