Protein language model embeddings and predictions of the human proteome

DOI10.5281/zenodo.5047020Zenodo5047020MaRDI QIDQ6718845FDOQ6718845

Dataset published at Zenodo repository.

Publication date: 30 June 2021

Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on2021.06.09)computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). Additionally: - Sequence-levelpredictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reportedin the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3) Files included: - human.fasta -- FASTA-formatted sequences of human from SwissProt -DSSP3_human_ProtT5Sec.fasta -- Secondary structure predictions in three states for each residue of each proteinin human.fasta. H stands for Helix; E stands for Sheet; C stands for Other. -subcell_human_LA_ProtT5.csv -- Subcellular location (10 states) and memrane-boundness (2 states)for each protein in human.fasta -embeddings_file.h5 -- per-residue embeddings of sequences in human.fasta. Each datasetin the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the original_id attribute. Seehttps://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file -reduced_embeddings_file.h5 -- per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each datasetin the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).

This page was built for dataset: Protein language model embeddings and predictions of the human proteome