webdata_wXa

OpenML350MaRDI QIDQ6033105FDOQ6033105RO-CrateQ6033105

OpenML dataset with id 350

John Platt

Full work available at URL: https://api.openml.org/data/v1/download/52253/webdata_wXa.sparse_arff

Upload date: 29 August 2014

Dataset Characteristics

Number of classes: 2
Number of features: 124 (numeric: 123, symbolic: 1 and in total binary: 1 )
Number of instances: 36,974
Number of instances with missing values: 0
Number of missing values: 0

Description

Author: John Platt Source: libSVM - Date unknown Please cite: John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.a

This is the famous webdata dataset w[1-8]a in its binary version, retrieved 2014-11-14 from the libSVM site. Additional to the preprocessing done there (see LibSVM site for details), this dataset was created as follows:

load all web data datasets, train and test, e.g. w1a, w1a.t, w2a, w2a.t, w3a, ...
join test and train for each subset, e.g. w1a and w1a.t, w2a and w2a.t
normalize each file columnwise according to the following rules:
If a column only contains one value (constant feature), it will set to zero and thus removed by sparsity.
If a column contains two values (binary feature), the value occuring more often will be set to zero, the other to one.
If a column contains more than two values (multinary/real feature), the column is divided by its std deviation.
afterwards all these 8 files are merged into one, and randomly sorted.
duplicate lines were finally removed.

An R script which does all of these steps can be found here: https://github.com/openml/data_scripts/blob/master/webdata_wXa/dataDownloader.R

ROCrate

What is a RO-Crate?

A RO-Crate is a standardized research object package used to bundle data together with rich machine-readable metadata. Each RO-Crate contains:

the files belonging to the dataset (e.g. CSVs, images, code, documentation)
a ro-crate-metadata.json file describing the content, provenance, and context
persistent identifiers and references to related research objects (e.g. software, publications)

This ensures that the dataset can be easily reused, cited, validated, and interpreted in a reproducible manner. More information can be found here.

Download

You can download a RO-Crate for this dataset here: Download RO-Crate

HINT: The RO-Crate is created dynamically, so it could take up to 30 seconds until the downloads starts.

This page was built for dataset: webdata_wXa