Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction.
DOI10.5281/zenodo.7729583Zenodo7729583MaRDI QIDQ6692719FDOQ6692719
Dataset published at Zenodo repository.
Shantanu Singh, Tim Becker, Kevin Yang, Anne E Carpenter, Paul A. Clemons, Nikita Moshkov, Juan C. Caicedo, Peter Horvath, Bridget K. Wagner, Vlado Dancik
Publication date: 16 March 2023
Copyright license: Creative Commons Attribution 4.0 International
This is the related data, both input and produced for the paper Predicting compound activity from phenotypic profiles and chemical structures. This data can be merged with papers GitHub repositoryfor reproduction. Folders and filesand are describedbelow: ├── assay_data ├── assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits. ├── assay_metadata.csv Assay metadata ├── broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. ├── smiles.txt Same as broad_ids.txt, but SMILES strings. ├── feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy) ├── cp.npz Classical chemical features ├── ge.npz Gene expression features ├── ge_scale.npz Gene expression scaled features ├── mo.npz Morphology features (not batch corrected) ├── mobc.npz Morphology features (batch corrected) ├── misc ├── compound_analysis.npz Compounds in the dataset identified as PAINS ├── compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one ├── fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity ├── similarity_fingerprints.npz Similarity matrix for compounds (16978) ├── population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction ├── Table for PUMA Excel file with additional data and plots ├── predictions ├── scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. ├── scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. ├── toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate. ├── Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements: ├── predictions Folder with predictions for each assay-compound pair for each modality ├── 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split ├── 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for *chemical* folders. ├── assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id. ├── assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp! Experiments in this folder are the following: - chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments. - chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing. - chemical_st Same splits as in chemical, but separate models were trained for each assay. - CV Random 5-fold cross-validation splits. - GE 5-fold cross-validation splits based on same-size clustering of gene expression features. - MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features. - random 10 random splits, ~80% of compounds in the training set and the rest in the test set. ├── splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. ├── scaffold_based_split.npz Splitting for scaffold-based splits. ├── random_split_{}.npz Random split indices of test set compounds (10 files). ├── cross_validation_indicies.npz Indices for random cross-validation splits ├── GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features. ├── MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.
This page was built for dataset: Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction.