Data Sets and Results for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"
DOI10.5281/zenodo.5153906Zenodo5153906MaRDI QIDQ6702189FDOQ6702189
Dataset published at Zenodo repository.
Alexander Zaitzeff, Jedediah Singer, Steven Haase, Francis Motta, Nick Leiby
Publication date: 2 August 2021
Copyright license: Creative Commons Attribution 4.0 International
Data sets and results for Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins The file dna_binding_protein_sequences.zip has the training and testing sets from the paper: RLL - random_train/test_full_1000.csv RSL - random_train/test_40.csv RSLL - random_train/test_40_1000.csv RLL where included positive examples have verified DNA binding activity -random_train/test_hq_1000.csv The 10 RSLL data sets - random_train/test_40_1000.csv +random_train/test_40_1000_cv_0-8.csv The results files arenamed similarly. See see_results.ipynb in the codebase that supplement thesedata sets The species data sets are derived from uniprot_data_bac.tab and uniprot_data_not_bac.tab. See code. The ESM embeddings used by the XGBoost model are in dna_binding_protein_esm.zip
This page was built for dataset: Data Sets and Results for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"