MetaTIS: A tool to predict eukaryotic translation initiation sites (Q6697265)

From MaRDI portal





Dataset published at Zenodo repository.
Language Label Description Also known as
default for all languages
No label defined
    English
    MetaTIS: A tool to predict eukaryotic translation initiation sites
    Dataset published at Zenodo repository.

      Statements

      0 references
      As the ribosome makes its foray unto a sequence of mRNA, ribosomes typicallycommence translation at a methionine-encoding AUG codon flanked by a so-calledKozak region, a short nucleic acid motif serving as an initiation site in manyeukaryotes. Though, the characteristic AUG start codon of an mRNA is not alwayseffective in initiating translation. In more seldom cases, near-cognate codonsequences may also be recognized as start sites. Ribosome profiling techniques,characterized by the stymieing of mRNA-ribosome complex translation function viachemical treatment, are able to elucidate active translation start sites. Historically,several bioinformatics classifiers have been trained on translation start site data,gleaned from ribosomal profiling, to predict putative translation initiation sites frommRNA sequence features. A stacking approach was formulated for the MetaTIS toolthat can differentiate spurious and true translation initiation sites. The tool wastrained on experimental data for translation initiation in HEK293 cells produced by theTISCA protocol, a method allowing for accurate translation initiation site identification.Our classifier delivers a notable ROC-AUC of 0.93 while performing on its own testset, as well as multiple external validation sets. Moreover, it was able to almostquantitively predict whether overlapping open-reading frames suppress translationfrom the main ORF for 11 genes in HeLa cells, as validated by experimentalluciferase assays. The MetaTIS tool is publicly available as a webserver The FlanksERF, KmersERF, and MetaTIS models with their training data can be found below. For information on how these models are utilized please refer togithub. The datasets are composed of 229 columns. Whereby, the first 168 represent the upstream (U) and downstream (D) kmers of sizes 1 till 3. Then comes the start codon used and the normalized Noderer et al. efficiency values based on the flanking region. Next, 40 columns representing the 20 upsteam (U) and 20 downstream (D) nucleotides with respect to the initiation site. The final 19 features represent the relative binding scores of the 9 RNA binding proteins (RBPs) considered. Note that some RBPs have multiple binding motifs. The FlanksERF and KmersERF are each composed of 40 random forest classifiers which are stored as a dictionary. scikit-learn version 1.5.1 was used to create these models. The DownstreamNegatives and Positives datasets were used to train the KmersERF model, while the UpstreamNegatives and Positives datasets were used to train the FlanksERF model.
      0 references
      5 February 2025
      0 references
      0 references
      0 references

      Identifiers

      0 references