Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models

DOI10.5281/zenodo.14743325Zenodo14743325MaRDI QIDQ6706224FDOQ6706224

Dataset published at Zenodo repository.

Lorenzo Segovia, Miguel Ángel González Arias, Alejandro Garciarrubio

Publication date: 26 January 2025

Copyright license: Creative Commons Attribution 4.0 International

The main dataset for the publication "Large-scale analysis of the -lactamase sequence space with protein language models". This dataset contains 29,445 rows and 82 columns and is provided in parquet format. The rows represent all sequences retrieved from the BLDB. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP. The 2-dimensional representations of PCA, t-SNE, and UMAP for the evaluated protein language models are provided as datasets in CSV and Parquet formats. The algorithm used and the specific set of beta-lactamases are indicated at the beginning of the filename: sbl for serine beta-lactamases and mbl for metallo-beta-lactamases. For more information, consult the following Github repository https://github.com/miangoar/Betalactamase-analysis-with-machine-learning

This page was built for dataset: Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models