Data mining antibody sequences for database searching in bottom-up proteomics
DOI10.5281/zenodo.11045596Zenodo11045596MaRDI QIDQ6705708FDOQ6705708
Dataset published at Zenodo repository.
Xuan-Tung Trinh, Veit Schwämmle, Konrad Krawczyk, Rebecca Freitag
Publication date: 22 April 2024
Copyright license: Creative Commons Attribution 4.0 International
Mass spectrometry (MS)-based proteomics is a powerful method for identifying and quantifying antibodies. Among the various MS approaches, bottom-up proteomics is especially effective for analyzing thousands of antibodies in complex mixtures. In this method, proteins are enzymatically digested into smaller peptides, typically using the protease trypsin, which are then analyzed via mass spectrometry. These peptides are matched to sequences in standard databases like UniProt or NCBI-RefSeq for identification. However, a major limitation of this approach is the absence of comprehensive disease-specific antibody databases. Current databases, such as UniProt, include only a fraction of the antibody sequences present in the human body. For instance, as of January 2024, UniProt contains just 38,800 immunoglobulin sequences, far short of the billions of antibodies the human immune system can produce. As a result, relying on such limited databases can lead to under-detection of antibodies, particularly those associated with specific diseases. Expanding antibody databases with disease-specific sequences is crucial for improving the accuracy of MS-based proteomics in identifying antibodies relevant to human health. Recently, through next-generation sequencing of antibody gene repertoires, it has become possible to obtain billions of antibody sequences (in amino acid format) by annotating, translating, and numbering antibody gene sequences. These large numbers of sequences are now available in public databases such as theObserved Antibody Space. We hypothesize that using these theoretical antibody sequences as new databases for bottom-up proteomics could address the current lack of antibody coverage in standard databases. We developed a workflow to create disease-specific antibody peptide databases for bottom-up proteomics. The workflow details are available on GitHub. The database and metadata files generated by this workflow are stored in this Zenodo dataset, and they are used in DAT-DB a web application that allows researchers to obtain FASTA files of disease-specific antibody peptides for direct use in bottom-up proteomics (see Demo version). Each database file in this dataset is in .duckdb format and contains tables with 10 columns: Sequence, Filename, Patient, BSource, BType, Isotype, N_patient, N_antibody, Length_aa, and CDR3. The "Sequence" column contains tryptic peptides. "Filename" is the file where the data was collected. "Patient" refers to the patient number as listed in metadata2.csv. "BSource" refers to the B-cells' source, and "BType" refers to the type of B-cells. "Isotype" specifies the antibody isotype (IgA, IgD, IgE, IgG, IgM, or Bulk). "N_patient" indicates the number of patients having this peptide, and "N_antibody" specifies the number of antibodies containing this peptide. "Length_aa" indicates the number of amino acids in the peptide, while "CDR3" shows whether the peptide is found in the CDR3 region. The file metadata1.csv contains information about each database file, while metadata2.csv provides details about the sources of the collected antibodies.
This page was built for dataset: Data mining antibody sequences for database searching in bottom-up proteomics