1,500 simulated transcriptomic variants for MINTIE paper
DOI10.5281/zenodo.4876713Zenodo4876713MaRDI QIDQ6683929FDOQ6683929
Dataset published at Zenodo repository.
Breon Schmidt, Ian Majewski, Marek Cmero, Paul Ekert, Alicia Oshlack, Nadia Davidson
Publication date: 14 May 2020
Copyright license: Creative Commons Attribution 4.0 International
Contains RNA-seq data set of 1,500 simulated heterozygous transcriptomic variants (500 fusions, 500 splice variants and 500 transcribed structural variants) used in the MINTIE paper. An additional 100 unmodified background genes were also added. The controls set contains unmodified sequences of all variant genes included in the case sample. Variant information and paired end reads, as well as the fasta files from which they were generated, are provided. Code used to generate these samples can be found under https://github.com/Oshlack/MINTIE/tree/master/simu. Simulations were generated by extracting sequence from the transcripts listed in the hg38 UCSC RefSeq reference, and simulating reads from the resulting sequence. 100 variants from 15 variant types were generated (five fusion types: canonical, extended exon, novel exon, with insertion and unpartnered, five TSV types: insertions, deletions, ITDs, PTDs and inversions, and five novel splice variants: extended exons, novel exons, truncated exons, skipped exons and retained introns. Only transcripts from genes that did not overlap any other genes were used in the simulation. Additionally, each transcript had to have at least 3 exons to be considered as a simulation transcript. All fusions were simulated by selecting the first two and the last two exons from two random transcripts from different genes, and inserting the intervening sequence. Canonical fusions contained no intervening sequence, while fusions with extended exons inserted 30-200bp of intronic sequence from the end of the second exon of the first transcript. Similarly, fusions with novel exons contained intronic sequence 30-200bp downstream with a size of 30-200bp. Non-canonical fusions with insertions were generated by inserting 7-50bp of randomly-generated sequence between the two fusion transcripts. Small TSVs were generated by inserting, duplicating or deleting sequence within randomly selected exons from randomly selected transcripts. These small variant types were between 7 and 50 base-pairs and had to reside at least 10bp within the exon. Inversions and partial-tandem duplications were generated by selecting 1-3 random exons within a transcript and either inverting or duplicating their sequence in tandem. Lastly, splice variants were generated by extending or placing novel exons downstream of a randomly selected exon. To ensure that novel or extended exons did not overlap exons from other transcripts (or downstream exons of the same transcript), each candidate exon was checked for these potential overlaps (which would otherwise result in obfuscation of the variant, or the wrong variant type being created). Novel junction variants were created by selecting a random pair of exons and checking whether an existing junction existed between them (creating a transcript with this junction if not). Two randomly-selected neighbouring exons were both truncated at their facing ends (end and start respectively) by 30-200bp. Retained introns included a random intronic sequence from a given transcript that was 30bp. The presence of correct splicing motifs was not considered for the simulation. In addition to each variant gene, the sequence to the unaltered wild-type gene was added to the simulated case samples reference. An additional 100 unaltered background genes were also added to the case sample. A control sample reference was also generated, which included the unaltered wildtype sequence only for all simulated transcripts. ART-illumina (doi:10.1093/bioinformatics/btr708) v2.5.8 was run on the corresponding references with 100bp paired-end reads with a fragment size of 300 and coverage of 50 (transcripts should thus have an effective coverage of 100, given the bi-allelic reference containing variant and wildtype transcripts). We also include three down-sampled versions of the simulation files (40x, 20x and 10x) used in the MINTIE paper. Note that the variant coverage will be half the sequence coverage. These were down-sampled using seqtk v1.0 (https://github.com/lh3/seqtk).
This page was built for dataset: 1,500 simulated transcriptomic variants for MINTIE paper