molecular-biology_promoters

From MaRDI portal
Dataset:6032977



OpenML164MaRDI QIDQ6032977

OpenML dataset with id 164

No author found.

Full work available at URL: https://api.openml.org/data/v1/download/3585/molecular-biology_promoters.arff

Upload date: 23 April 2014


Dataset Characteristics

Number of classes: 2
Number of features: 58 (numeric: 0, symbolic: 58 and in total binary: 1 )
Number of instances: 106
Number of instances with missing values: 0
Number of missing values: 0

Author: C. Harley, R. Reynolds, M. Noordewier, J. Shavlik. Source: UCI) - 1990 Please cite: UCI

E. coli promoter gene sequences (DNA) Compilation of promoters with known transcriptional start points for E. coli genes. The task is to recognize promoters in strings that represent nucleotides (one of A, G, T, or C). A promoter is a genetic region which initiates the first step in the expression of an adjacent gene (transcription).

The input features are 57 sequential DNA nucleotides. Fifty-three sample promoters and 53 nonpromoter sequences were used. The 53 sample promoters were obtained from a compilation produced by Hawley and McClure (1983). Negative training examples were thus derived by selecting contiguous substrings from a 1.5 kilobase sequence provided by Prof. T. Record of the Univ. of Wisconsin’s Chemistry Dept. This sequence is a fragment from E. coli bacteriophage T7 isolated with the restriction enzyme HaeIII. By virtue of the fact that the fragment does not bind RNA polymerase, it is believed to not contain any promoter sites.

This dataset has been developed to help evaluate a "hybrid" learning algorithm ("KBANN") that uses examples to inductively refine preexisting knowledge.

Attribute Description

  • 1. One of {+/-}, indicating the class ("+" = promoter).
  • 2. The instance name (non-promoters named by position in the 1500-long nucleotide sequence provided by T. Record).
  • 3-59. The remaining 57 fields are the sequence, starting at position -50 (p-50) and ending at position +7 (p7). Each of these fields is filled by one of {a, g, t, c}.

Relevant papers

  • Harley, C. and Reynolds, R. 1987. "Analysis of E. Coli Promoter Sequences." Nucleic Acids Research, 15:2343-2361.
  • Towell, G., Shavlik, J. and Noordewier, M. 1990. "Refinement of Approximate Domain Theories by Knowledge-Based Artificial Neural Networks." In Proceedings of the Eighth National Conference on Artificial Intelligence (AAAI-90).