Peptide and Protein ID using SearchGUI and PeptideShaker
Author(s) | Florian Christoph Sigloch Björn Grüning |
Reviewers |
OverviewQuestions:Objectives:
How to convert LC-MS/MS raw files?
How to identify peptides?
How to identify proteins?
How to evaluate the results?
Requirements:
Protein identification from LC-MS/MS raw files.
Time estimation: 45 minutesLevel: Introductory IntroductorySupporting Materials:Published: Jun 12, 2017Last modification: Nov 9, 2023License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00229rating Rating: 4.5 (0 recent ratings, 2 all time)version Revision: 27
Identifying the proteins contained in a sample is an important step in any proteomic experiment. However, in most experimental set ups, proteins are digested to peptides before the LC-MS/MS analysis. In this so-called “bottom-up” procedure, only peptide masses are measured. Therefore, protein identification cannot be performed directly from raw data, but is a multi-step process:
- Raw data preparation
- Peptide-to-Spectrum matching
- Peptide inference
- Protein inference
A plethora of software solutions exist for each step. In this tutorial, we will show how to use the ProteoWizard tool MSconvert and the OpenMS tool PeakPickerHiRes for step 1, and the Compomics tools SearchGUI and PeptideShaker, for the steps 2-4.
For an alternative identification pipeline using only tools provided by the OpenMS software suite, please consult this tutorial.
Input data
As an example dataset, we will use an LC-MS/MS analysis of HeLa cell lysate published in Vaudel et al., 2014, Proteomics. Detailed information about the dataset can be found on PRIDE. For step 2, we will use a validated human Uniprot FASTA database without appended decoy sequences. If you already completed the tutorial on Database Handling you can use the constructed database priot to the DecoyDatabase tool step. You can find a prepared database, as well as the input proteomics data in different file formats on Zenodo.
AgendaIn this tutorial, we will deal with:
Preparing Raw Data
Raw data conversion is the first step of any proteomic data analysis. The most common converter is msconvert from the ProteoWizard software suite, the format to convert to is mzML. SearchGUI needs MGF
format as input, but as we need the mzML
format for several other tasks, we will convert to mzML
first. Due to licensing reasons, msconvert runs only on windows systems and will not work on most Galaxy servers.
Depending on your machine settings, raw data will be generated either in profile mode or centroid mode. For most peptide search engines, the tandem mass spectrometry (MS2) data have to be converted to centroid mode, a process called “peak picking” or “centroiding”. Machine vendors offer algorithms to extract peaks from profile raw data. This is implemented in msconvert tool and can be run in parallel to the mzML conversion. However, the OpenMS tool PeakPickerHiRes tool is reported to generate slightly better results (Lange et al., 2006, Pac Symp Biocomput) and is therefore recommended for quantitative studies (Vaudel et al., 2010, Proteomics). If your data were generated on a low resolution mass spectrometer, use PeakPickerWavelet tool instead.
Hands-on: File Conversion and Peak PickingWe provide the input data in the original
raw
format and also already converted toMGF
andmzML
file formats. If msconvert tool does not run on your Galaxy instance, please download the preconvertedmzML
as an input.
Create a new history for this Peptide and Protein ID exercise.
To create a new history simply click the new-history icon at the top of the history panel:
- Load the example dataset into your history from Zenodo: raw mzML
- Rename the dataset to something meaningful.
- (optional) Run msconvert tool on the test data to convert to the
mzML
format.- Run PeakPickerHiRes tool on the resulting file. Click
+ Insert param.algorithm_ms_levels
and change the entry to “2”. Thus, peak picking will only be performed on MS2 level.- Run FileConverter tool on the picked mzML. In the Advanced Options set the Output file type to
MGF
.Comment: Local Use of MSConvertThe vendor libraries used by msconvert are only licensed for Windows systems and are therefore rarely implemented in Galaxy instances. If msconvert tool is not available in your Galaxy instance, please install the software on a Windows computer and run the conversion locally. You can find a detailed description of the necessary steps (“Peak List Generation”). Afterwards, upload the resulting mzML file to your Galaxy history.
Peptide and Protein Identification
Mass spectrometry experiments identify peptides by isolating them, ioinizing and subsequently colliding them with a gas for fragmentation. This method generates a spectrum of peptide fragment masses for each isolated peptide - an MS2 spectrum. To find out the peptide sequences, the MS2 spectrum is compared to a theoretical spectrum generated from a protein database. This step is called peptide-to-spectrum (also: spectrum-to-sequence) matching. Accordingly, a peptide that is successfully matched to a sequence is termed PSM (Peptide-Spectrum-Match). There can be multiple PSMs per peptide, if the peptide was fragmented several times. Different peptide search engines have been developed to fulfill the matching procedure.
It is generally recommended to use more than one peptide search engine and use the combined results for the final peptide inference (Shteynberg et al., 2013, Mol. Cell. Proteomics). Again, there are several software solutions for this, e.g. iProphet (TPP) or ConsensusID (OpenMS). In this tutorial we will use Search GUI tool, as it can automatically search the data using several search engines. Its partner tool Peptide Shaker tool is then used to combine and evaluate the search engine results.
In bottom-up proteomics, it is necessary to combine the identified peptides to proteins. This is not a trivial task, as proteins are redundant in most eukaryotic organisms. Thus, not every peptide can be assigned to only one protein. Luckily, the Peptide Shaker tool already takes care of protein inference and even gives us some information on validity of the protein identifications. We will discuss validation in a later step of this tutorial.
Hands-on: Peptide and Protein Identification
- Copy the prepared protein database from the tutorial Database Handling into your current history by using the multiple history view or upload the ready-made database from Zenodo.
- Open Search GUI tool to search the mgf file against the protein database. In the
Search Engine Options
selectX!Tandem
andMS-GF+
. In theProtein Modification Options
add theFixed Modifications
:Carbamidomethylation of C
and theVariable Modifications
:Oxidation of M
.- Run Peptide Shaker tool on the Search GUI output. Enable the following outputs:
Zip File for import to Desktop App
,mzidentML File
,PSM Report
,Peptide Report
,Protein Report
.Comment: Search GUI ParametersWe ran Search GUI tool with default settings. When you are processing files of a different experiment, you may need to adjust some of the parameters. Search GUI bundles numerous peptide search engines for matching MS/MS to peptide sequences within a database. In practice, using 2-3 different search engines offers high confidence while keeping analysis time reasonable. In our hands, X! tandem, MS-GF+, OMSSA and Comet search algorithms offer good results. The
Precursor Options
have to be adjusted to the mass spectrometer which was used to generate the files. The default settings fit a high resolution Orbitrap instrument. In theAdvanced Options
you may set much more detailed settings for each of the used search engines. When using X!Tandem, we recommend to switch off the advanced X!Tandem optionsNoise suppression
,Quick Pyrolidone
andQuick Acetyl
. When using MSGF, we recommend to select the correctInstrument type
.Comment: PeptideShaker OutputsPeptide Shaker offers a variety of outputs. The
Zip File for import to Desktop App
can be downloaded to view and evaluate the search results in the Peptide Shaker viewer (Download). The severalReports
contain tabular, human-readable information. Also, anmzidentML
(=mzid
) file can be created that contains all peptide sequence matching information and can be utilized by compatible downstream software. TheCertificate of Analysis
provides details on all parameters settings of both Search GUI and Peptide Shaker used for the analysis.Question
- How many peptides were identified? How many proteins?
- How many peptides with oxidized methionine were identified?
- You should have identified 3,325 peptides and 1,170 proteins.
- 328 peptides contain an oxidized methionine (MeO). To get to this number, you can use Select tool on the Peptide Report and search for either “Oxidation of M” or “M<ox>”.
Analysis of Contaminants
The FASTA database used for the peptide to spectrum matching contained some entries that were not expected to stem from the HeLa cell lysate, but are common contaminations in LC-MS/MS samples. The main reason to add those is to avoid misidentification of the spectra to other proteins. However, it also enables you to check for contaminations in your samples. CAVE: in human samples, many proteins that are common contaminants may also stem from the real sample. The real source of such human proteins might require advanced investigation.
Hands-on: Analysis of Contaminants
- Run Select tool on the Peptide Shaker Protein Report to select all lines that match the pattern “CONTAMINANT”.
- Remove all contaminants from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “CONTAMINANT”.
Question
- Which contaminants did you identify? Where do these contaminations come from?
- What other sources of contaminants exist?
- How many mycoplasma proteins did you identify? Does this mean that the analyzed HeLa cells were infected with mycoplasma?
- How many false positives do we expect in our list? How many of these are expected to match mycoplasma proteins?
- TRY_BOVIN is bovine trypsin. It was used to degrade the proteins to peptides. ALBU_BOVIN is bovine serum albumin. It is added to cell culture medium in high amounts.
- Contaminants often stem from the experimenter, these are typically keratins or other high-abundant human proteins. Basically any protein present in the room of the mass spectrometer might get into the ion source, if it is airborne. As an example, sheep keratins are sometimes found in proteomic samples, stemming from clothing made of sheep wool.
- There should be five Mycoplasma proteins in your protein list. However, all of them stem from different Mycoplasma species. Also, every protein was identified by one peptide only. You can see this in column 17-19 of your output. These observations make it quite likely that we might have identified false positives here.
- As we were allowing for a false discovery rate of 1 %, we would expect 12 false positive proteins in our list. False positives are expected to be randomly assigned to peptides in the FASTA database. Our database consists of about 20,000 human proteins and 4,000 mycoplasma proteins. Therefore, we would expect 17 % (= 2) of all false positives matching to mycoplasma proteins.
Evaluation of Peptide and Protein IDs
Peptide Shaker tool provides you with validation results for the identified PSM, peptides and proteins. It classifies all these IDs in the categories “Confident” or “Doubtful”. On each level, the meaning of these terms differs to some extent:
- PSMs are marked as “Doubtful” when the measured MS2 spectrum did not fit well to the theoretical spectrum.
- Peptides have a combined scoring of their PSMs. They are marked as “Doubtful”, when the score is below a set threshold. The threshold is defined by the false discovery rate (FDR).
- Proteins are marked as “Doubtful”, when they were identified by only a single peptide or when they were identified solely by “Doubtful” peptides.
Hands-on: Evaluation of Peptide and Protein IDs
- Remove all “Doubtful” proteins from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “Doubtful”.
Question
- How to exclude mycoplasma proteins?
- How many “Confident” non-contaminant proteins were identified?
- Add another Select tool matching the pattern “HUMAN”.
- You should have identified 582 human non-contaminant proteins that were validated to be “Confident”.
Premade Workflow
A premade workflow for this tutorial is available