Project:OpenMLDatamodels: Difference between revisions

Latest revision as of 12:03, 11 April 2024

Dataset

Dataset from OpenML; currently, there is no way to associate these datasets with papers, unless there is a doi match

label: OpenML dataset name
description: OpenML dataset with id {ID}
instance of (Property:P31) data set (MaRDI: Item:Q56885)
OpenML dataset ID (Property:P1473): unique dataset ID from OpenML
dataset version: version of dataset
author name string: This input was free text and can't be associated to an ID of any kind at the moment - creators and contributors are both used here
collection date: from freeform text field --> string is used as is
upload date: this is an automatic timestamp
license: license name gets matched to License items in KG like in the CRAN importer ("Public" is the default in OpenML)
full work available at url: both the fields "url" and "original data url" are used for this
default target attribute (e.g. class)
row id attribute
OpenML semantic tag (Property:P1465): these were automatically tagged in OpenML; only tags in this list are considered: Agriculture, Astronomy, Chemistry, Computational Universe, Computer Systems, Culture, Demographics, Earth Science, Economics, Education, Geography, Government, Health, History, Human Activities, Images, Language, Life Science, Machine Learning, Manufacturing, Mathematics, Medicine, Meteorology, Physical Sciences, Politics, Social Media, Sociology, Statistics, Text & Literature, Transportation
cites work: this points to a Publication item if a doi or arxiv id could get extracted from the citation text; if there is a doi or an arxiv ID, the importer tries to find existing papers with that ID; else, a new paper gets created
citation text: raw citation text
has feature: features and their data types, such as the feature "width" with the data type "numeric"
number of binary features
number of classes
number of features
number of instances
number of instances with missing values
number of missing values
number of numeric features
number of symbolic features
file format: ARFF or Sparse ARFF
MaRDI profile type: MaRDI dataset profile

Publication

If no publication is found for the identifier, a publication item which just consists of the identifier without a label is created

label: None
description: scientific article about an OpenML dataset
arxiv ID (if present)
doi (if present)
MaRDI profile type: MaRDI publication profile

Sample item

Item:Q6032831 (anneal)

@@ Line 1: / Line 1: @@
-=== Dataset ===
+== Dataset ==
 Dataset from OpenML; currently, there is no way to associate these datasets with papers, unless there is a doi match
 * label: OpenML dataset name
 * description: OpenML dataset with id {ID}
-* instance of data set (wikidata: Q1172284)
+* instance of ([[Property:P31]]) data set (MaRDI: [[Item:Q56885]])
-* OpenML dataset ID: unique dataset ID from OpenML
+* OpenML dataset ID ([[Property:P1473]]): unique dataset ID from OpenML
 * dataset version: version of dataset
 * author name string: This input was free text and can't be associated to an ID of any kind at the moment - creators and contributors are both used here
 * collection date: from freeform text field --> string is used as is
 * upload date: this is an automatic timestamp
-* license: license name gets matched to License items in KG like in the CRAN importer
+* license: license name gets matched to License items in KG like in the CRAN importer ("Public" is the default in  OpenML)
 * full work available at url: both the fields "url" and "original data url" are used for this
 * default target attribute (e.g. class)
 * row id attribute
-* OpenML semantic tag: these were automatically tagged in OpenML; only tags in this list are considered:  Agriculture, Astronomy, Chemistry, Computational Universe, Computer Systems, Culture, Demographics, Earth Science, Economics, Education, Geography, Government, Health, History, Human Activities, Images, Language, Life Science, Machine Learning, Manufacturing, Mathematics, Medicine, Meteorology, Physical Sciences, Politics, Social Media, Sociology, Statistics, Text & Literature, Transportation
+* OpenML semantic tag ([[Property:P1465]]): these were automatically tagged in OpenML; only tags [https://query.portal.mardi4nfdi.de/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ6032783%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D in this list] are considered:  Agriculture, Astronomy, Chemistry, Computational Universe, Computer Systems, Culture, Demographics, Earth Science, Economics, Education, Geography, Government, Health, History, Human Activities, Images, Language, Life Science, Machine Learning, Manufacturing, Mathematics, Medicine, Meteorology, Physical Sciences, Politics, Social Media, Sociology, Statistics, Text & Literature, Transportation
-* citation text: this is a Publication item; if there is a doi or an arxiv ID, the importer tries to find existing papers with that ID; else, a new paper gets created
+* cites work: this points to a Publication item if a doi or arxiv id could get extracted from the citation text; if there is a doi or an arxiv ID, the importer tries to find existing papers with that ID; else, a new paper gets created
+* citation text: raw citation text
 * has feature: features and their data types, such as the feature "width" with the data type "numeric"
 * number of binary features
@@ Line 28: / Line 29: @@
 * MaRDI profile type: MaRDI dataset profile
-=== Publication ===
+== Publication ==
 If no publication is found for the identifier, a publication item which just consists of the identifier without a label is created
@@ Line 36: / Line 37: @@
 * doi (if present)
 * MaRDI profile type: MaRDI publication profile
+== Sample item ==
+* [[Item:Q6032831]] (anneal)