Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'

DOI10.5281/zenodo.11085958ZenodoMaRDI QIDQ6718908FDO

Publication date 29 April 2024

Datasets used for development of VespaG and VespaG predictions generated with https://github.com/JSchlensok/VespaG. Uploads contain: Performance summaries for ProteinGym [1]:- Spearman and Pearson correlation for VespaG:proteingym_performance_vespag.csv(columns: 'DMS_id', 'Spearman', 'Pearson')- Spearman correlation for evaluated methods VespaG, GEMME [2], VESPA [3], TranceptEVE [4], AlphaMissense [5], PoET [6]: proteingym_spearman_allmethods.csv(columns: 'DMS_id', 'Trancept EVE-L', 'VESPA', 'VespaG', 'GEMME', 'AlphaMissense', 'PoET', 'UniProt_ID', 'coarse_selection_type' (function), 'taxon') Fasta files with sequences for all train sets (vespag_fasta_training_datasets.zip with seq_all9k.fasta, seq_human5k.fasta, seq_droso4k.fasta, seq_ecoli2k.fasta, seq_virus1k.fasta) and test set (proteingym_217.fasta) VespaG Predictions for test set:vespag_proteingym_rawpreds_by_training_dataset.zip with raw_preds_ecoli.csv, raw_preds_human.csv, raw_preds_virus.csv, raw_preds_all.csv, raw_preds_droso.csv (columns: 'DMS_id', 'mutation', 'DMS_score', 'VespaG'). Predictions are based on different training data, the final model VespaG was trained on a subset of the human proteome and raw VespaG predictions for the ProteinGym benchmark are inraw_preds_human.csv (used to calculate the performances above). GEMME predictions for train sets:vespag_proteingym_rawpreds_by_training_dataset.zipwith folders 'human', 'droso', 'ecoli', 'virus', 'all' for respective fasta file (each containing GEMME mutational landscape output files named 'ID' + '_normPred_evolCombi.txt') ESM-2 embeddings [7] for test set (proteingym_217_esm2.h5) For details on VespaG see: VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction Celine Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine bioRxiv 2024.04.24.590982; doi: https://doi.org/10.1101/2024.04.24.590982 For more information on data usage and generation please seehttps://github.com/JSchlensok/VespaG. Abstract: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast single amino acid variant effect predictor, leveraging embeddings of protein Language Models as input to a minimal deep learning model. To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. Assessed against the ProteinGym Substitution Benchmark (217 multiplex assays of variant effect with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48 +/- 0.01, matching state-of-the-art methods such as GEMME, TranceptEVE, PoET, AlphaMissense, and VESPA. VespaG reached its top-level performance several orders of magnitude faster, predicting all mutational landscapes of the human proteome in 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM). [1] Notin, Pascal, et al. "ProteinGym: large-scale benchmarks for protein fitness prediction and design." Advances in Neural Information Processing Systems 36 (2024).[2] Laine, Elodie, Yasaman Karami, and Alessandra Carbone. "GEMME: a simple and fast global epistatic model predicting mutational effects." Molecular biology and evolution 36.11 (2019): 2604-2619. [3] Marquet, Cline, et al. "Embeddings from protein language models predict conservation and variant effects." Human genetics 141.10 (2022): 1629-1647. [4] Notin, Pascal, et al. "TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction." bioRxiv (2022): 2022-12. [5] Cheng, Jun, et al. "Accurate proteome-wide missense variant effect prediction with AlphaMissense." Science 381.6664 (2023): eadg7492. [6] Truong Jr, Timothy, and Tristan Bepler. "PoET: A generative model of protein families as sequences-of-sequences." Advances in Neural Information Processing Systems 36 (2024). [7] Lin, Zeming, et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science379.6637 (2023): 1123-1130.

This page was built for dataset: Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'