webdata_wXa

From MaRDI portal
Dataset:6033105



OpenML350MaRDI QIDQ6033105FDOQ6033105RO-CrateQ6033105

OpenML dataset with id 350

John Platt

Full work available at URL: https://api.openml.org/data/v1/download/52253/webdata_wXa.sparse_arff

Upload date: 29 August 2014



Dataset Characteristics

Number of classes: 2
Number of features: 124 (numeric: 123, symbolic: 1 and in total binary: 1 )
Number of instances: 36,974
Number of instances with missing values: 0
Number of missing values: 0

Author: John Platt Source: libSVM - Date unknown Please cite: John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.a

This is the famous webdata dataset w[1-8]a in its binary version, retrieved 2014-11-14 from the libSVM site. Additional to the preprocessing done there (see LibSVM site for details), this dataset was created as follows:

  • load all web data datasets, train and test, e.g. w1a, w1a.t, w2a, w2a.t, w3a, ...
  • join test and train for each subset, e.g. w1a and w1a.t, w2a and w2a.t
  • normalize each file columnwise according to the following rules:
  • If a column only contains one value (constant feature), it will set to zero and thus removed by sparsity.
  • If a column contains two values (binary feature), the value occuring more often will be set to zero, the other to one.
  • If a column contains more than two values (multinary/real feature), the column is divided by its std deviation.
  • afterwards all these 8 files are merged into one, and randomly sorted.
  • duplicate lines were finally removed.

An R script which does all of these steps can be found here: https://github.com/openml/data_scripts/blob/master/webdata_wXa/dataDownloader.R





ROCrate

What is a RO-Crate?

A RO-Crate is a standardized research object package used to bundle data together with rich machine-readable metadata. Each RO-Crate contains:

  • the files belonging to the dataset (e.g. CSVs, images, code, documentation)
  • a ro-crate-metadata.json file describing the content, provenance, and context
  • persistent identifiers and references to related research objects (e.g. software, publications)

This ensures that the dataset can be easily reused, cited, validated, and interpreted in a reproducible manner. More information can be found here.

Download

You can download a RO-Crate for this dataset here: Download RO-Crate

HINT: The RO-Crate is created dynamically, so it could take up to 30 seconds until the downloads starts.


This page was built for dataset: webdata_wXa