PreTIS2 Positive and Negative Training Data (Q6697258)

!

WARNING

This is the item page for this Wikibase entity, intended for internal use and editing purposes.

Please use the normal view instead:

PreTIS2 Positive and Negative Training Data

Dataset published at Zenodo repository.

Language	Label	Description	Also known as
default for all languages	No label defined
English	PreTIS2 Positive and Negative Training Data	Dataset published at Zenodo repository.

Statements

instance of

data set

0 references

description

We retrieved experimental data on translation initiation sites in HEK293T cells from table S2 of (https://doi.org/10.1093/nar/gkab549) that were determined by those authors using a protocol termed TISCA. As we wanted to focus on non-canonical translation, we filtered that data and kept only truncation, extension, uORF, and overlap.uORF translation initiation types that involve either ATG or any of the nine near-cognate start codons. Using the translated amino acid sequences and transcript version IDs provided by (https://doi.org/10.1093/nar/gkab549), we fetched the respective 5'UTR and cDNA nucleotide sequences from ensembl biomart using human gencode GRCh38.p13. Then, we scanned each cDNA sequence in all three open reading frames and identified the longest matching nucleotide sequence that encodes the respective amino acid chain. For extension, uORF, and overlap.uORF initiation sites, the start site had to be inside the 5'UTR. For truncation initiation sites, the start site had to be inside the coding sequence. We also needed to construct negative cases where translation initiation supposedly does not occur. To this aim, we scanned the 5'UTR sequences for all 10 possible start codons. As negatives, we then considered those start codons that were not detected by TISCA, but have enough nucleotides upstream to meet the feature criteria. This resulted in a massive imbalance between both classes, with 17 times as many negatives as positive samples. We used as features the identity of the 20 nucleotide positions upstream of a putative start codon, represented as "U", and the 20 nt positions downstream, represented as "D". As each start codon may have its own optimal sequence context, we additionally considered the nature of the particular translation initiation start codon (ATG, CTG, GTG, TTG, AAG, ACG, AGG, ATA, ATC, and ATT). The positive and negative datasets can be found below. Whereby, each row is a possible translation initiation site and the columns symbolize the flanking nucleotide for each position as well as the start codon used.

0 references

publication date

12 June 2024

0 references

0 references

0 references

Creative Commons Attribution 4.0 International

0 references