Data Sets and Results for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"

From MaRDI portal
Dataset:6702189



DOI10.5281/zenodo.5153906Zenodo5153906MaRDI QIDQ6702189FDOQ6702189

Dataset published at Zenodo repository.

Alexander Zaitzeff, Jedediah Singer, Steven Haase, Francis Motta, Nick Leiby

Publication date: 2 August 2021

Copyright license: Creative Commons Attribution 4.0 International



Data sets and results for Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins The file dna_binding_protein_sequences.zip has the training and testing sets from the paper: RLL - random_train/test_full_1000.csv RSL - random_train/test_40.csv RSLL - random_train/test_40_1000.csv RLL where included positive examples have verified DNA binding activity -random_train/test_hq_1000.csv The 10 RSLL data sets - random_train/test_40_1000.csv +random_train/test_40_1000_cv_0-8.csv The results files arenamed similarly. See see_results.ipynb in the codebase that supplement thesedata sets The species data sets are derived from uniprot_data_bac.tab and uniprot_data_not_bac.tab. See code. The ESM embeddings used by the XGBoost model are in dna_binding_protein_esm.zip







This page was built for dataset: Data Sets and Results for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"