splice

From MaRDI portal
(Redirected from Dataset:6032902)
Splice (OpenML dataset with id 46)



OpenML46MaRDI QIDQ6032902FDOQ6032902RO-CrateQ6032902

OpenML dataset with id 46

Genbank

Full work available at URL: https://api.openml.org/data/v1/download/46/splice.arff

Upload date: 6 April 2014



Dataset Characteristics

Number of classes: 3
Number of features: 61 (numeric: 0, symbolic: 61 and in total binary: 0 )
Number of instances: 3,190
Number of instances with missing values: 0
Number of missing values: 0

Author: Genbank. Donated by G. Towell, M. Noordewier, and J. Shavlik Source: UCI) Please cite: None

Primate splice-junction gene sequences (DNA) with associated imperfect domain theory. Splice junctions are points on a DNA sequence at which 'superfluous' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). (In the biological community, IE borders are referred to a acceptors while EI borders are referred to as donors.)

All examples taken from Genbank 64.1. Categories "ei" and "ie" include every "split-gene" for primates in Genbank 64.1. Non-splice examples taken from sequences known not to include a splicing site.

Attribute Information

>

             1   One of {n ei ie}, indicating the class.
             2   The instance name.
          3-62   The remaining 60 fields are the sequence, starting at 
                 position -30 and ending at position +30. Each of
                 these fields is almost always filled by one of 
                 {a, g, t, c}. Other characters indicate ambiguity among
                 the standard characters according to the following table:
   character: meaning
       D: A or G or T
       N: A or G or C or T
       S: C or G
       R: A or G

Notes:

  • Instance_name is an identifier and should be ignored for modelling






ROCrate

What is a RO-Crate?

A RO-Crate is a standardized research object package used to bundle data together with rich machine-readable metadata. Each RO-Crate contains:

  • the files belonging to the dataset (e.g. CSVs, images, code, documentation)
  • a ro-crate-metadata.json file describing the content, provenance, and context
  • persistent identifiers and references to related research objects (e.g. software, publications)

This ensures that the dataset can be easily reused, cited, validated, and interpreted in a reproducible manner. More information can be found here.

Download

You can download a RO-Crate for this dataset here: Download RO-Crate

HINT: The RO-Crate is created dynamically, so it could take up to 30 seconds until the downloads starts.


This page was built for dataset: splice