codrna

From MaRDI portal
Dataset:6033107



OpenML351MaRDI QIDQ6033107FDOQ6033107RO-CrateQ6033107

OpenML dataset with id 351

Andrew V Uzilov, David H Mathews., Joshua M Keegan

Full work available at URL: https://api.openml.org/data/v1/download/52254/codrna.sparse_arff

Upload date: 29 August 2014



Dataset Characteristics

Number of classes: 2
Number of features: 9 (numeric: 8, symbolic: 1 and in total binary: 1 )
Number of instances: 488,565
Number of instances with missing values: 0
Number of missing values: 0

Author: Andrew V Uzilov","Joshua M Keegan","David H Mathews. Source: original - Please cite: [AVU06a] Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7(173), 2006.

This is the cod-rna dataset, retrieved 2014-11-14 from the libSVM site. Additional to the preprocessing done there (see LibSVM site for details), this dataset was created as follows: -join test, train and rest datasets -normalize each file columnwise according to the following rules: -If a column only contains one value (constant feature), it will set to zero and thus removed by sparsity. -If a column contains two values (binary feature), the value occuring more often will be set to zero, the other to one. -If a column contains more than two values (multinary/real feature), the column is divided by its std deviation.

NOTE: please keep in mind that cod-rna has many duplicated data points, within each file (train,test,rest) and also accross these files. these duplicated points have not been removed!





ROCrate

What is a RO-Crate?

A RO-Crate is a standardized research object package used to bundle data together with rich machine-readable metadata. Each RO-Crate contains:

  • the files belonging to the dataset (e.g. CSVs, images, code, documentation)
  • a ro-crate-metadata.json file describing the content, provenance, and context
  • persistent identifiers and references to related research objects (e.g. software, publications)

This ensures that the dataset can be easily reused, cited, validated, and interpreted in a reproducible manner. More information can be found here.

Download

You can download a RO-Crate for this dataset here: Download RO-Crate

HINT: The RO-Crate is created dynamically, so it could take up to 30 seconds until the downloads starts.


This page was built for dataset: codrna