PFAM Protein Families Dataset for Machine Learning

From MaRDI portal
(Redirected from Dataset:6691214)



DOI10.5281/zenodo.8167436Zenodo8167436MaRDI QIDQ6691214FDOQ6691214

Dataset published at Zenodo repository.

Andreas Dominik

Publication date: 10 July 2023

Copyright license: Creative Commons Attribution 4.0 International



A cleaned dataset of protein sequences and protein families for classification. The dataset is exported from PFAM as of June 2023 and curated to achieve the following characteristics: only protein families included with =100 sequences families with 2000 sequences are truncated and only represented by 2000 sequences (chosen randomly) only proteins with sequence lengths between 100 and 1000 amino acid sequences are form PDB; chains are concatenated only if not similar The dataset is not balanced, numbers of sequences per family in PFAM and in in dataset are: families: 62, sequences: 46872 total (in PFAM) - included (in dataset) Number in family ALLERGEN: 122 - 122 Number in family APOPTOSIS: 381 - 381 Number in family BIOSYNTHETIC PROTEIN: 346 - 346 Number in family BIOTIN BINDING PROTEIN: 165 - 165 Number in family BLOOD CLOTTING: 138 - 138 Number in family CALCIUM BINDING PROTEIN: 135 - 135 Number in family CELL ADHESION: 1116 - 1116 Number in family CELL CYCLE: 511 - 511 Number in family CHAPERONE: 964 - 964 Number in family CONTRACTILE PROTEIN: 158 - 158 Number in family CYTOKINE: 191 - 191 Number in family DE NOVO PROTEIN: 253 - 253 Number in family DNA BINDING PROTEIN: 1008 - 1008 Number in family ELECTRON TRANSPORT: 841 - 841 Number in family FLUORESCENT PROTEIN: 348 - 348 Number in family GENE REGULATION: 607 - 607 Number in family HORMONE: 272 - 272 Number in family HORMONE GROWTH FACTOR: 159 - 159 Number in family HORMONE RECEPTOR: 121 - 121 Number in family HYDROLASE: 19551 - 2000 Number in family HYDROLASE ANTIBIOTIC: 120 - 120 Number in family HYDROLASE HYDROLASE INHIBITOR: 2890 - 2000 Number in family HYDROLASE INHIBITOR: 315 - 315 Number in family IMMUNE SYSTEM: 3333 - 2000 Number in family IMMUNOGLOBULIN: 155 - 155 Number in family ISOMERASE: 2457 - 2000 Number in family ISOMERASE ISOMERASE INHIBITOR: 139 - 139 Number in family LECTIN: 139 - 139 Number in family LIGASE: 1780 - 1780 Number in family LIGASE LIGASE INHIBITOR: 163 - 163 Number in family LIPID BINDING PROTEIN: 421 - 421 Number in family LIPID TRANSPORT: 115 - 115 Number in family LUMINESCENT PROTEIN: 221 - 221 Number in family LYASE: 4150 - 2000 Number in family LYASE LYASE INHIBITOR: 298 - 298 Number in family MEMBRANE PROTEIN: 1338 - 1338 Number in family METAL BINDING PROTEIN: 951 - 951 Number in family METAL TRANSPORT: 409 - 409 Number in family MOTOR PROTEIN: 195 - 195 Number in family OXIDOREDUCTASE: 11531 - 2000 Number in family OXIDOREDUCTASE OXIDOREDUCTASE INHIBITOR: 766 - 766 Number in family OXYGEN STORAGE: 127 - 127 Number in family OXYGEN STORAGE TRANSPORT: 260 - 260 Number in family OXYGEN TRANSPORT: 414 - 414 Number in family PHOTOSYNTHESIS: 173 - 173 Number in family PLANT PROTEIN: 255 - 255 Number in family PROTEIN BINDING: 1613 - 1613 Number in family PROTEIN TRANSPORT: 693 - 693 Number in family RECEPTOR: 108 - 108 Number in family REPLICATION: 161 - 161 Number in family RNA BINDING PROTEIN: 546 - 546 Number in family SIGNALING PROTEIN: 2312 - 2000 Number in family STRUCTURAL PROTEIN: 869 - 869 Number in family SUGAR BINDING PROTEIN: 1250 - 1250 Number in family TOXIN: 546 - 546 Number in family TRANSCRIPTION REGULATION: 3283 - 2000 Number in family TRANSFERASE: 14724 - 2000 Number in family TRANSFERASE INHIBITOR: 126 - 126 Number in family TRANSFERASE TRANSFERASE INHIBITOR: 2465 - 2000 Number in family TRANSLATION: 370 - 370 Number in family TRANSPORT PROTEIN: 2782 - 2000 Number in family VIRAL PROTEIN: 2150 - 2000 Files: families.csv: list of protein families with frequencies pfam_46872x62.csv: full dataset with amino acid sequences as string (one-letter code) pfam-trn-xy.csv: training dataset with amino acid sequences as tokens (1..25) and padded to a common length of 1000 with padding token 0: Amino acid | Token | Description -------------------------------- C | 1 | Cysteine S | 2 | Serine T | 3 | Threonine A | 4 | Alanine G | 5 | Glycine P | 6 | Proline D | 7 | Aspartic acid E | 8 | Glutamic acid Q | 9 | Glutamine N | 10 | Asparagine H | 11 | Histidine R | 12 | Arginine K | 13 | Lysine M | 14 | Methionine I | 15 | Isoleucine L | 16 | Leucine V | 17 | Valine W | 18 | Tryptophan Y | 19 | Tyrosine F | 20 | Phenylalanine B | 21 | Aspartic acid or Asparagine Z | 22 | Glutamic acid or Glutamine J | 23 | Leucine or Isoleucine U | 24 | Selenocysteine X | 25 | Unknown amino acid . | 0 | padding token pfam-trn-labels.csv: plain-text labels for training data pfam-tst-xy.csv pfam-tst-labels.csv: test data pfam-balanced-trn-xy.csv pfam-balanced-trn-labels.csv: pfam-balanced-tst-xy.csv pfam-balanced-tst-labels.csv: balanced datasets, created by oversampling.







This page was built for dataset: PFAM Protein Families Dataset for Machine Learning