Datasets for evaluating scalable supervised learning for synthesize-on-demand chemical libraries

DOI10.5281/zenodo.8051351ZenodoMaRDI QIDQ6694114FDO

Authors Gene E. Ananiev, Anthony Gitter, Shengchao Liu, Scott A. Wildman, Song Guo, Moayad Alnammi, Spencer S. Ericksen, Andrew F. Voter, F. Michael Hoffmann, James L. Keck

Publication date 17 June 2023

Copyright license Creative Commons Attribution 4.0 International

Description

This repository contains datasets for the manuscript Evaluating scalable supervised learning for synthesize-on-demand chemical libraries: ams_all_preds.csv.gz: The AMS dataset predictions when using an RF or baseline model trained on the training dataset. Includes the predicted score and rank from each model for each compound. We started with 8,434,707 AMS compounds and detected that 247,025 were in the LC or MLPCN training data. These were removed from the AMS list, leaving 8,187,682 compounds to score. The compound matching was done on the SMILES that we canonicalized in rdkit. ams_order_results.csv.gz: Information about the 1,024 compounds purchased from the AMS library. Excludes the 4 AMS compounds that were incompletely dissolved. Includes the chemical feature representation, information from the vendor, RF and baseline model predictions, screening results, and clustering results. baseline_weight.npy: The saved Similarity Baseline model, which consists of the active compounds in the training data. This model was used to score the AMS library. See the GitHub repositoryfor code to load the model and make predictions on new compounds. cdd_training_data.tar.gz: The LC1234 and MLPCN PriA-SSB screening data exported from CDD. enamine_costs_clustered_v3_with_nneighbor.csv.gz: Contains 5,620 Enamine compounds that were selected based on the RF prediction score and availability. This file also contains the Taylor-Butina cluster ID when clustering the training compounds, 1,024 tested AMS compounds, and top-ranked Enamine compounds at a 0.4 threshold. The nearest neighbor compounds in the training and AMS sets are also included along with compound information from Enamine, RF model scores, and chemical feature representations. enamine_dose_response_curve_plots.xlsx: Images of the dose response curves from all three runs on the 68 Enamine compounds. If a compound was tested multiple times, multiple curves are shown in the same plot. The compound structure images and SMILES are exported from CDD, not generated with RDKit. enamine_dose_response_curves.tsv: The dose response curve summaries from all three runs on the 68 Enamine compounds. If a compound was tested multiple times, only the highest-quality dose response curve was used. enamine_final_list.csv.gz: The final 100 filtered compounds fromenamine_top_10000.csv.gz. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results. enamine_PriA-SSB_dose_response_data.tar.gz: The dose response screening data from all three runs on the 68 Enamine compounds. The 2021-06-16 run was originally screened on 2020-08-24. 2021-06-16 is the date the compound identities were corrected. This run contains two 1,536 well plates. enamine_top_10000.csv.gz: Top 10,000 predictions from the Enamine REAL dataset using the selected RF model. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results. master_df.csv.gz: The output of preprocessing the files incdd_training_data.tar.gz. Contains 441,900 rows. random_forest_classification_139.pkl: The saved RF classification model withhyperparameter ID 139. This model was used to score the AMS and Enamine REAL libraries. See the GitHub repository directory for code to load the model and make predictions on new compounds. train_ams_real_cluster.csv.gz: Contains cluster IDs for Taylor-Butina clustering at a 0.4 threshold applied to the training compounds, 1,024 tested AMS compounds, and top-ranked compounds from Enamine. Includes the chemical features, dataset to which the compound belongs, leader compound for each cluster, and whether the compound is a known hit. training_df_single_fold.csv.gz: This is all ten folds intraining_folds.tar.gzmerged for convenience. Contains 427,300 compounds. training_df_single_fold_with_ams_clustering.csv.gz: Contains cluster IDs for Taylor-Butina clustering applied to the 427,300 training compounds and the 1,024 tested AMS compounds. Different clustering results are shown at the 0.2, 0.3, and 0.4 thresholds. Includes the leader compound for each cluster. Although the training and AMS compounds were clustered jointly, only the training compounds clusters are shown. The AMS compounds clusters are inams_order_results.csv.gz. training_folds.tar.gz: The LC1234 and MLPCN training data split into ten folds. This dataset with 427,300 compounds was used for cross validation and model selection. This dataset is derived frommaster_df.csv.gz. If you usethesedatasets in a publication, please cite: Moayad Alnammi, Shengchao Liu, Spencer S. Ericksen, Gene E. Ananiev, Andrew F. Voter, Song Guo, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter.Evaluating scalable supervised learning for synthesize-on-demand chemical libraries.Journal of Chemical Information and Modeling2023. SeePubChem AID1272365, AID1918986,and the associated publications for details about the PriA-SSB screening data. The screening datasets were compiled from three separate sources that should all be cited if the training dataset is used in a publication: Moayad Alnammi, Shengchao Liu, Spencer S. Ericksen, Gene E. Ananiev, Andrew F. Voter, Song Guo, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter.Evaluating scalable supervised learning for synthesize-on-demand chemical libraries.Journal of Chemical Information and Modeling2023. Shengchao Liu+, Moayad Alnammi+, Spencer S. Ericksen, Andrew F. Voter, Gene E. Ananiev, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter.Practical model selection for prospective virtual screening.Journal of Chemical Information and Modeling2018. Andrew F. Voter+, Michael P. Killoran+, Gene E. Ananiev, Scott A. Wildman, F. Michael Hoffmann, James L. Keck.A high-throughput screening strategy to identify inhibitors of SSB proteinprotein interactions in an academic screening facility.SLAS Discovery2018.

This page was built for dataset: Datasets for evaluating scalable supervised learning for synthesize-on-demand chemical libraries