PreTIS2 Positive and Negative Training Data (Q6697258)

From MaRDI portal





Dataset published at Zenodo repository.
Language Label Description Also known as
default for all languages
No label defined
    English
    PreTIS2 Positive and Negative Training Data
    Dataset published at Zenodo repository.

      Statements

      0 references
      We retrieved experimental data on translation initiation sites in HEK293T cells from table S2 of (https://doi.org/10.1093/nar/gkab549) that were determined by those authors using a protocol termed TISCA. As we wanted to focus on non-canonical translation, we filtered that data and kept only truncation, extension, uORF, and overlap.uORF translation initiation types that involve either ATG or any of the nine near-cognate start codons. Using the translated amino acid sequences and transcript version IDs provided by (https://doi.org/10.1093/nar/gkab549), we fetched the respective 5'UTR and cDNA nucleotide sequences from ensembl biomart using human gencode GRCh38.p13. Then, we scanned each cDNA sequence in all three open reading frames and identified the longest matching nucleotide sequence that encodes the respective amino acid chain. For extension, uORF, and overlap.uORF initiation sites, the start site had to be inside the 5'UTR. For truncation initiation sites, the start site had to be inside the coding sequence. We also needed to construct negative cases where translation initiation supposedly does not occur. To this aim, we scanned the 5'UTR sequences for all 10 possible start codons. As negatives, we then considered those start codons that were not detected by TISCA, but have enough nucleotides upstream to meet the feature criteria. This resulted in a massive imbalance between both classes, with 17 times as many negatives as positive samples. We used as features the identity of the 20 nucleotide positions upstream of a putative start codon, represented as "U", and the 20 nt positions downstream, represented as "D". As each start codon may have its own optimal sequence context, we additionally considered the nature of the particular translation initiation start codon (ATG, CTG, GTG, TTG, AAG, ACG, AGG, ATA, ATC, and ATT). The positive and negative datasets can be found below. Whereby, each row is a possible translation initiation site and the columns symbolize the flanking nucleotide for each position as well as the start codon used.
      0 references
      12 June 2024
      0 references
      0 references
      0 references

      Identifiers

      0 references