christine_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True (Q6037679): Difference between revisions

From MaRDI portal
Importer (talk | contribs)
Created a new Item
 
Added link to MaRDI item.
 
links / mardi / namelinks / mardi / name
 

Latest revision as of 12:29, 16 April 2024

OpenML dataset with id 44685
Language Label Description Also known as
English
christine_seed_2_nrows_2000_nclasses_10_ncols_100_stratify_True
OpenML dataset with id 44685

    Statements

    0 references
    Subsampling of the dataset christine (41142) with\N\Nseed=2\Nargs.nrows=2000\Nargs.ncols=100\Nargs.nclasses=10\Nargs.no_stratify=True\NGenerated with the following source code:\N\N\N```python\N def subsample(\N self,\N seed: int,\N nrows_max: int = 2_000,\N ncols_max: int = 100,\N nclasses_max: int = 10,\N stratified: bool = True,\N ) -> Dataset:\N rng = np.random.default_rng(seed)\N\N x = self.x\N y = self.y\N\N # Uniformly sample\N classes = y.unique()\N if len(classes) > nclasses_max:\N vcs = y.value_counts()\N selected_classes = rng.choice(\N classes,\N size=nclasses_max,\N replace=False,\N p=vcs / sum(vcs),\N )\N\N # Select the indices where one of these classes is present\N idxs = y.index[y.isin(classes)]\N x = x.iloc[idxs]\N y = y.iloc[idxs]\N\N # Uniformly sample columns if required\N if len(x.columns) > ncols_max:\N columns_idxs = rng.choice(\N list(range(len(x.columns))), size=ncols_max, replace=False\N )\N sorted_column_idxs = sorted(columns_idxs)\N selected_columns = list(x.columns[sorted_column_idxs])\N x = x[selected_columns]\N else:\N sorted_column_idxs = list(range(len(x.columns)))\N\N if len(x) > nrows_max:\N # Stratify accordingly\N target_name = y.name\N data = pd.concat((x, y), axis="columns")\N _, subset = train_test_split(\N data,\N test_size=nrows_max,\N stratify=data[target_name],\N shuffle=True,\N random_state=seed,\N )\N x = subset.drop(target_name, axis="columns")\N y = subset[target_name]\N\N # We need to convert categorical columns to string for openml\N categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]\N columns = list(x.columns)\N\N return Dataset(\N # Technically this is not the same but it's where it was derived from\N dataset=self.dataset,\N x=x,\N y=y,\N categorical_mask=categorical_mask,\N columns=columns,\N )\N```
    0 references
    Eddie Bergman
    0 references
    2022-11-17
    0 references
    17 November 2022
    0 references
    class
    0 references
    aaa8e49d7d4931b78611f346df3c48be
    0 references
    1
    0 references
    2
    0 references
    101
    0 references
    2,000
    0 references
    0
    0 references
    97
    0 references
    0 references

    Identifiers

    0 references