ecoli (Q6032894)

OpenML dataset with id 39

Language	Label	Description	Also known as
English	ecoli	OpenML dataset with id 39

Statements

instance of

data set

0 references

dataset version identifier

1

0 references

description

**Author**: \N**Source**: Unknown - \N**Please cite**: \N\N1. Title: Protein Localization Sites\N \N \N 2. Creator and Maintainer:\N \T Kenta Nakai\N Institue of Molecular and Cellular Biology\N \T Osaka, University\N \T 1-3 Yamada-oka, Suita 565 Japan\N \T nakai@imcb.osaka-u.ac.jp\N http://www.imcb.osaka-u.ac.jp/nakai/psort.html\N Donor: Paul Horton (paulh@cs.berkeley.edu)\N Date: September, 1996\N See also: yeast database\N \N 3. Past Usage.\N Reference: "A Probablistic Classification System for Predicting the Cellular \N Localization Sites of Proteins", Paul Horton & Kenta Nakai,\N Intelligent Systems in Molecular Biology, 109-115.\N \T St. Louis, USA 1996.\N Results: 81% for E.coli with an ad hoc structured\N \T probability model. Also similar accuracy for Binary Decision Tree and\N \T Bayesian Classifier methods applied by the same authors in\N \T unpublished results.\N \N Predicted Attribute: Localization site of protein. ( non-numeric ).\N \N \N 4. The references below describe a predecessor to this dataset and its \N development. They also give results (not cross-validated) for classification \N by a rule-based expert system with that version of the dataset.\N \N Reference: "Expert Sytem for Predicting Protein Localization Sites in \N Gram-Negative Bacteria", Kenta Nakai & Minoru Kanehisa, \N PROTEINS: Structure, Function, and Genetics 11:95-110, 1991.\N \N Reference: "A Knowledge Base for Predicting Protein Localization Sites in\N \T Eukaryotic Cells", Kenta Nakai & Minoru Kanehisa, \N \T Genomics 14:897-911, 1992.\N \N \N 5. Number of Instances: 336 for the E.coli dataset and \N \N \N 6. Number of Attributes.\N for E.coli dataset: 8 ( 7 predictive, 1 name )\N \T \N 7. Attribute Information.\N \N 1. Sequence Name: Accession number for the SWISS-PROT database\N 2. mcg: McGeoch's method for signal sequence recognition.\N 3. gvh: von Heijne's method for signal sequence recognition.\N 4. lip: von Heijne's Signal Peptidase II consensus sequence score.\N Binary attribute.\N 5. chg: Presence of charge on N-terminus of predicted lipoproteins.\N \T Binary attribute.\N 6. aac: score of discriminant analysis of the amino acid content of\N \T outer membrane and periplasmic proteins.\N 7. alm1: score of the ALOM membrane spanning region prediction program.\N 8. alm2: score of ALOM program after excluding putative cleavable signal\N \T regions from the sequence.\N \N NOTE - the sequence name has been removed\N \N 8. Missing Attribute Values: None.\N \N \N 9. Class Distribution. The class is the localization site. Please see Nakai &\N \T\T Kanehisa referenced above for more details.\N \N cp (cytoplasm) 143\N im (inner membrane without signal sequence) 77 \N pp (perisplasm) 52\N imU (inner membrane, uncleavable signal sequence) 35\N om (outer membrane) 20\N omL (outer membrane lipoprotein) 5\N imL (inner membrane lipoprotein) 2\N imS (inner membrane, cleavable signal sequence) 2

0 references

Kenta Nakai

0 references

1996-09-01

0 references

upload date

6 April 2014

0 references