webdata_wXa

OpenML dataset with id 350

No author found.

Full work available at URL: https://api.openml.org/data/v1/download/52253/webdata_wXa.sparse_arff

Upload date: 29 August 2014

Dataset Characteristics

Number of classes: 2
Number of features: 124 (numeric: 123, symbolic: 1 and in total binary: 1 )
Number of instances: 36,974
Number of instances with missing values: 0
Number of missing values: 0

Description

Author: John Platt Source: libSVM - Date unknown Please cite: John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.a

This is the famous webdata dataset w[1-8]a in its binary version, retrieved 2014-11-14 from the libSVM site. Additional to the preprocessing done there (see LibSVM site for details), this dataset was created as follows:

load all web data datasets, train and test, e.g. w1a, w1a.t, w2a, w2a.t, w3a, ...
join test and train for each subset, e.g. w1a and w1a.t, w2a and w2a.t
normalize each file columnwise according to the following rules:
If a column only contains one value (constant feature), it will set to zero and thus removed by sparsity.
If a column contains two values (binary feature), the value occuring more often will be set to zero, the other to one.
If a column contains more than two values (multinary/real feature), the column is divided by its std deviation.
afterwards all these 8 files are merged into one, and randomly sorted.
duplicate lines were finally removed.

An R script which does all of these steps can be found here: https://github.com/openml/data_scripts/blob/master/webdata_wXa/dataDownloader.R

This page was built for dataset: webdata_wXa