SDDF Energy Dataset

From MaRDI portal
Dataset:6701004



DOI10.5281/zenodo.14008357Zenodo14008357MaRDI QIDQ6701004FDOQ6701004

Dataset published at Zenodo repository.

Garegin A. Papoian, Khachik Smbatyan, Tigran Aghajanyan, Garik Petrosyan, Vahagn Altunyan, Tsolak Ghukasyan, Aram Bughdaryan

Publication date: 29 October 2024

Copyright license: Creative Commons Attribution 4.0 International



This conformational energy dataset, developed as part of the Smart Distributed Data Factory (SDDF) project, contains over 2.17 million molecular conformations based on drug-like molecules sourced from theENAMINE database. Energies were calculated usingDFT with the B97x density functional and the 631G(d) basis set. The conformations were generated from SMILES using RDKit, MMFF94 optimization, and molecular dynamics (MD) simulations, providing a diverse set of molecular structures and energy states. RDKit Conformations: 535,338 RDKit + MMFF94 Optimized: 1,151,936 MD-Generated: 483,279 This dataset serves as a benchmark for energy prediction models, with training (638,617 examples), validation (134,732 examples), and test subsets (24,890 examples) created using a strict scaffold-based split to ensure no overlap and less than 70% similarity between the training and test sets. Dataset contents: data.tar.gz: contains the conformations in Structured Data File format, grouped into separate folders based on the molecule ID. INDEX.smi: specifies the molecule IDs and their corresponding SMILES. SOURCES.csv: specifies the conformation generation method for each conformation. SDDF_train.tsv, SDDF_validation.tsv, and SDDF_test.tsvspecify the molecule IDs and conformations for each subset of the benchmark. A detailed description is provided in the accompanying paper.







This page was built for dataset: SDDF Energy Dataset