Red-Wine-Quality

OpenML43695MaRDI QIDQ6036789FDOQ6036789RO-CrateQ6036789

OpenML dataset with id 43695

Author name not available (Why is that?)

Full work available at URL: https://api.openml.org/data/v1/download/22102520/Red-Wine-Quality.arff

Upload date: 24 March 2022

Dataset Characteristics

Number of features: 12 (numeric: 12, symbolic: 0 and in total binary: 0 )
Number of instances: 1,599
Number of instances with missing values: 0
Number of missing values: 0

Description

Context The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.) Content For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Tips What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm) KNIME is a great tool (GUI) that can be used for this. 1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA. 2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:

quality 6.5 = "good" TRUE = "bad" 3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking) 4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75/25, choose 'random' or 'stratified') 5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and 6- Partitioning Node test data split output to input Decision Tree predictor Node 7- Decision Tree learner Node output to input Decision Tree Node input 8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

Inspiration Use machine learning to determine which physiochemical properties make a wine 'good'! Acknowledgements This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset. Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Relevant publication P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Cited In (1)

An information criterion for auxiliary variable selection in incomplete data analysis

ROCrate

What is a RO-Crate?

A RO-Crate is a standardized research object package used to bundle data together with rich machine-readable metadata. Each RO-Crate contains:

the files belonging to the dataset (e.g. CSVs, images, code, documentation)
a ro-crate-metadata.json file describing the content, provenance, and context
persistent identifiers and references to related research objects (e.g. software, publications)

This ensures that the dataset can be easily reused, cited, validated, and interpreted in a reproducible manner. More information can be found here.

Download

You can download a RO-Crate for this dataset here: Download RO-Crate

HINT: The RO-Crate is created dynamically, so it could take up to 30 seconds until the downloads starts.

This page was built for dataset: Red-Wine-Quality