Missing proteins identified from the haploid cell lines
Abstract
The goal of Chromosome-centric Human Proteomic Project (C-HPP) is identifying and characterizing proteins from each human chromosome. As of December 2, 2016, there are 476 protein-coding genes with 65 missing proteins on chromosome 22 listed in the neXtProt database. In general, it is believed that missing proteins may be low abundant in many cell types or expressed in only a few cell types. Thus, we hypothesize a haploid karyotyped cell might express those missing proteins. HAP1 is the fibroblast-like cell line derived from the male chronic myelogenous leukemia (CML) cell line KBM-7 which is nearly haploid. In this experiment, we searched for missing proteins of chromosome 22 using HAP1 and KBM7 proteomic data. The raw mass spectrometry data of HAP1 and KBM7 were acquired in duplicate each with a total of 50 fractions that were searched using the annotated neXtProt protein database using SEQUEST. After applying 1% false discovery rates (FDRs) at both PSM and protein levels, we were able to identify a few missing proteins. We developed in house Python script that matches the missing protein sequences on chromosome 22 with the identified peptide sequences from HAP1 and KBM7. Currently, we are analyzing the transcriptomic data to integrate them with proteomic data and to carry out the proteogenomic analysis.
Discovery of Protein Variants from Human Brain Tissues using Customized Databases from GENCODE and neXtProt with Mass Spectrometry
Abstract
The goal of the Chromosome-centric Human Proteomic Project (C-HPP) is to fully provide proteomic information from each human chromosome, including missing proteins as well as novel protein variants, which are expressed from non-coding genomic regions, alternative splicing variants (ASVs), and single amino-acid variants (SAAVs). Using customized database from GENCODE and neXtProt, we developed a workflow including integrated proteomic pipeline (IPP) which consists of three different search engines such as Mascot, SEQUEST, and MS-GF+, as well as statistical evaluation tools combined with a homemade program for the normalization of three different scores. We compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of less than 1% at the protein level. Using the IPP, a total of 5,756 proteins (vs 4,453 using the CPP) including 477 ASVs (vs. 182 using the CPP) were identified from human hippocampal tissues. From the RAW files of human brain tissues, including adult and fetal brains, deposited in ProteomeXchange, we newly discovered several novel variants using the IPP that could be applicable to C-HPP studies.
The current state of C-HPP and development of bioinformatic tools for identifying missing proteins
Abstract
The Chromosome-Centric Human Proteome Project (C-HPP) has been progressed steadily since it was officially launched in 2012. The goals of the C-HPP are to map and annotate the entire human protein set encoded in each chromosome in a coordinated manner (1, 2). In a chromosome-wide approach including Chromosome 13, our team has previously identified 4,239 proteins including 33 missing proteins from placenta tissue (3) and constructed a genome-wide protein database through which one can deposit the proteome data obtained from diverse human clinical samples and present their distribution in a traffic-light map that shows up-to-date status of protein characterization per each gene. We improved the features and utility of this DB, named GenomewidePDB (http://genomewidepdb.proteomix.org/), by integrating transcriptomic information (e.g., alternatively spliced transcripts), annotated peptide information, and an advanced search interface that can find proteins of interest when applying a targeted proteomics strategy (4,5). We assume that missing proteins may be present in rare samples in very low abundance or be only temporarily expressed, causing problems in their detection and protein profiling. Thus, the development of a better strategy that results in greater sensitivity and accuracy in the search for missing proteins is necessary so we used a new strategy, which combines a reference spectral library search and a simulated spectral library search, to identify missing proteins, and proved this strategy can significantly enhance the number of matches and can also help identify peptides those have not been detected by conventional sequence database searches with improved sensitivity and a low error rate (6). Another goal of C-HPP is to characterize the PTM of each protein such as glycosylatoin. The identification and characterization of glycoprotein and glycosylation sites by mass spectrometry (MS) remain challenging tasks, and great efforts have been devoted to the development of proteome informatics tools that facilitate the MS analysis of glycans and glycopeptides. We recently developed the gFinder (http://gFinder.proteomix.org/), a web-based bioinformatics tool that analyzes mixtures of native N-glycopeptides that have been profiled by tandem MS. gFinder can characterize peptide backbone sequences and possible N-glycan structures using assigned scores by using both collision-induced dissociation (CID) and high-energy collisional dissociation (HCD) fragmentation data. We also show that gFinder can be used for identification of potential missing proteins having glycans (7).
Code / Date
/ March 31(FRI) 10:40-11:10
Speaker
Young-Ki Paik, Je-Yoel Cho, Jong Shin Yoo, Akhilesh Pandey, Min-Sik Kim
Affiliation
Title
Roundtable discussion on the current status of the C-HPP and its future
Rm#422, Industry-University Research Bldg., Yonsei University
50 Yonsei-ro, Seodeamoon-gu, Seoul 03722 Korea
Tel : +82-2-393-8328 Fax : +82-2-393-6589 E-mail : admin@khupo.org
COPYRIGHT ⓒ 2016 KHUPO. ALL RIGHTS RESERVED.