sylvine_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True (Q6037676)
From MaRDI portal
OpenML dataset with id 44682
Language | Label | Description | Also known as |
---|---|---|---|
English | sylvine_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True |
OpenML dataset with id 44682 |
Statements
1
0 references
Subsampling of the dataset sylvine (41146) with\N\Nseed=4\Nargs.nrows=2000\Nargs.ncols=100\Nargs.nclasses=10\Nargs.no_stratify=True\NGenerated with the following source code:\N\N\N```python\N def subsample(\N self,\N seed: int,\N nrows_max: int = 2_000,\N ncols_max: int = 100,\N nclasses_max: int = 10,\N stratified: bool = True,\N ) -> Dataset:\N rng = np.random.default_rng(seed)\N\N x = self.x\N y = self.y\N\N # Uniformly sample\N classes = y.unique()\N if len(classes) > nclasses_max:\N vcs = y.value_counts()\N selected_classes = rng.choice(\N classes,\N size=nclasses_max,\N replace=False,\N p=vcs / sum(vcs),\N )\N\N # Select the indices where one of these classes is present\N idxs = y.index[y.isin(classes)]\N x = x.iloc[idxs]\N y = y.iloc[idxs]\N\N # Uniformly sample columns if required\N if len(x.columns) > ncols_max:\N columns_idxs = rng.choice(\N list(range(len(x.columns))), size=ncols_max, replace=False\N )\N sorted_column_idxs = sorted(columns_idxs)\N selected_columns = list(x.columns[sorted_column_idxs])\N x = x[selected_columns]\N else:\N sorted_column_idxs = list(range(len(x.columns)))\N\N if len(x) > nrows_max:\N # Stratify accordingly\N target_name = y.name\N data = pd.concat((x, y), axis="columns")\N _, subset = train_test_split(\N data,\N test_size=nrows_max,\N stratify=data[target_name],\N shuffle=True,\N random_state=seed,\N )\N x = subset.drop(target_name, axis="columns")\N y = subset[target_name]\N\N # We need to convert categorical columns to string for openml\N categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]\N columns = list(x.columns)\N\N return Dataset(\N # Technically this is not the same but it's where it was derived from\N dataset=self.dataset,\N x=x,\N y=y,\N categorical_mask=categorical_mask,\N columns=columns,\N )\N```
0 references
2022-11-17
0 references
17 November 2022
0 references
class
0 references
1
0 references
2
0 references
21
0 references
2,000
0 references
0
0 references
0
0 references
20
0 references
1
0 references