High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices

DOI10.5281/zenodo.4300971Zenodo4300971MaRDI QIDQ6718838FDOQ6718838

Dataset published at Zenodo repository.

Michael Heinzinger, Violetta Cavalli-Sforza, Issar Arab, Burkhard Rost

Publication date: 15 September 2020

Copyright license: Creative Commons Attribution 4.0 International

As part of hismaster thesis at the Rostlab, which is located at the Technical University of Munich (TUM),Mr. Issar Arabdevelopedthe first language model that encodes evolutionary information of proteins explicitly. The pre-training involved the creation of a novel high-quality dataset of protein sequences (around 1.83million proteins, or ~0.8 Billion amino acids) with their corresponding Position Specific Scoring Matrices (PSSMs). Those matrices reflect the relative frequency of each amino acid at each position in a protein and is derived from evolutionarily related proteins. Mr. Arab makes this work publicly available to help other researchers speed up their work to leverage AI to learn the representation of protein evolutionary information more explicitly. The set of sequences was derived by extracting all PSSMs from thePredictProtein(PP) cache, whichwere also part o the UniProtReference Cluster with 50% sequence identity (uniref50 2019_12). The overlap between PP and uniref50 was further filtered to only include high-quality samples, e.g. only multiple sequence alignments with a certain number of aligned sequences were considered. The processing led toa training set of 1.83 Million sequences, a validation set of 879 instances, and a test set of 879 entries.The training data of proteins is reduced to 40% sequence identity, with respect to the validation/test sets, and contains sequences ranging between 18 and 9858 residues in length. Refer to the Jupyter notebook for a detailed description of the files'structure and a Python code snippet to correctly manipulatethis data. To access the full originalwork, please visit the following link: ManuscriptNote: The dataset was recently used to fine tune a protein sequence language model (PEvoLM). The work was presented at the CIBCB'23 conference. If you use PEvoLM or this dataset in your work, please cite the following publication: - Issar Arab, PEvoLM: Protein Sequence Evolutionary Information Language Model, IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Eindhoven, Netherlands, (2023), pp. 1-8, doi:10.1109/CIBCB56990.2023.10264890

This page was built for dataset: High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices