Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Workflow Results

DOI10.5281/zenodo.5987905ZenodoMaRDI QIDQ6724604FDO

Authors Madison Mehlferber, Rachel Miller, Simran Kaur, Ben Jordan, Ana Conesa, Erin Jeffery, Michael Shortreed, Lloyd Smith, Anne Deslattes Mays, Christina Chatzipantsiou, Simone Tiberi, Gloria Sheynkman, Robert Millikin

Publication date 30 January 2022

Copyright license Creative Commons Attribution 4.0 International

Description (long)

The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g. PacBio, Oxford Nanopore) provides full-length transcript sequencing, which can be used to predict full-length proteins. Here, we describe a long-read proteogenomics approach for integrating matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data in protein inference to enable detection of protein isoforms that are intractable to MS detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis. Companion Repositories: Long-Read-Proteogenomics Workflow GitHub Repository Release Long-Read-Proteogenomics Analysis GitHub Repository Release Companion Datasets Long-Read-Proteogenomics Workflow Sample and Reference Data TEST Data for Long-Read-Proteogenomics Workflow GitHub Actions This Repository contains the complete output from the execution of theLong-Read-Proteogenomics Workflow, using the input fromJurkat Samples and Reference Data. The filejurkat.flnc.bamwas 6.5 GB had to be split into 13 separate files and for use should be rejoined -- here are the steps that were used to split the file up. 1. Convertjurkat.flnc.bam(binary format) to sam file (text format) without header:samtools view jurkat.flnc.bam jurkat.flnc.sam 2. Capture the header:samtools view -H jurkat.flnc.bam jurkat.flnc.header.sam 3. Splitjurkat.flnc.saminto smaller files (aim to get final size under 2GB):split -l 400000 jurkat.flnc.sam jurkat.flnc.chunk. 4. Convert each of these files back to bam for uploading:samtools view -b jurkat.flnc.chunk.a* -o jurkat.flnc.chunk.a*.bam (*=a,b,c,d,e,f,g,h,i,j,k,l,m) After downloading, reverse this process including using the header file which is found in theLRPG-Manuscript-Results-results-results-jurkat-isoseq3-companion-files.tar.gz file 1. Convert the bam files back to sam files:samtools view jurkat.flnc.chunk.a*.bam jurkat.flnc.chunk.a*.sam (*=a,b,c,d,e,f,g,h,i,j,k,l,m) 2. Combine the header together with the sam files:cat jurkat.flnc.chunk.a*sam jurkcat.flnc.sam (verified the same number of lines of the sam files is identical to the number of lines of the original without header: 4,956,761. Header file is 13 lines. 3. Convert to bam files if desired:samtools view -b jurkat.flnc.sam -o jurkat.flnc.bam 4. Rehead with the header file:samtools reheader -P -i jurkat.flnc.header.sam jurkat.flnc.bam

This page was built for dataset: Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Workflow Results