NADA: A synthetic shape benchmark for testing probabilistic deep learning models

DOI10.5281/zenodo.14361221Zenodo14361221MaRDI QIDQ6702953FDOQ6702953

Dataset published at Zenodo repository.

Volpini Federico, Claudia Caudai, Giulio del Corso, D. Moroni, Sara Colantonio

Publication date: 10 December 2024

NADA (Not-A-Database) is an easy-to-use geometric shape data generator that allows users to define non-uniform multivariate parameter distributions to test novel methodologies. The full open-source package is provided at GIT:NA_DAtabase. See Technical Report for details on how to use the provided package. This database includes 3 repositories: NADA_Dis: Is the model able to correctly characterize/Disentangle a complex latent space?The repository contains 3x100,000 synthetic black and white images to test the ability of the models to correctly define a proper latent space (e.g., autoencoders) and disentangle it. The first 100,000 images contain 4 shapes and uniform parameter space distributions, while the other images have a more complex underlying distribution (truncated Gaussian and correlated marginal variables). NADA_OOD: Does the model identify Out-Of-Distribution images?The repository contains 100,000 training images (4 different shapes with 3 possible colors located in the upper left corner of the canvas) and 6x100,000 increasingly different sets of images (changing the color class balance, reducing the radius of the shape, moving the shape to the lower left corner) providing increasingly challenging out-of-distribution images.This can help to test not only the capability of a model, but also methods that produce reliability estimates and should correctly classify OOD elements as "unreliable" as they are far from the original distributions. NADA_AlEp: Does the model distinguish between different types (Aleatoric/Epistemic) of uncertainties?The repository contains 5x100,000 images with different type of noise/uncertainties: NADA_AlEp_0_Clean: Dataset clean of noise to use as a possible training set. NADA_AlEp_1_White_Noise: Epistemic white noise dataset. Each image is perturbed with an amount of white noise randomly sampled from 0% to 90%. NADA_AlEp_2_Deformation: Dataset with Epistemic deformation noise. Each image is deformed by a randomly amount uniformly sampled between 0% and 90%. 0% corresponds to the original image, while 100% is a full deformation to the circumscribing circle. NADA_AlEp_3_Label: Dataset with label noise. Formally, 20% of Triangles of a given color are missclassified as a Square with a random color (among Blue, Orange, and Brown) and viceversa (Squares to Triangles). Label noise introduces \textit{Aleatoric Uncertainty} because it is inherent in the data and cannot be reduced. NADA_AlEp_4_Combined: Combined dataset with all previous sources of uncertainty. Each image can be used for classification (shape/color) or regression (radius/area) tasks. All datasets can be modified and adapted to the user's research question using the included open source data generator.

This page was built for dataset: NADA: A synthetic shape benchmark for testing probabilistic deep learning models