Clinical Metaproteomics 3: Verification

Author(s)	Subina Mehta Katherine Do Dechen Bhuming
Editor(s)	Pratik Jagtap Timothy J. Griffin
Reviewers

Overview
Questions:

Why do we need to verify our identified peptides

What is the importance of making a new database for quantification

Objectives:

Verification of peptides helps in confirming the presence of the peptides in our samplle

Requirements:

Introduction to Galaxy Analyses

Proteomics

Time estimation: 3 hours

Supporting Materials:

Datasets

Workflows

FAQs

video Recordings

video Tutorial (June 2024) - 16m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.be ✅

UseGalaxy.cz ✅

Possibly Working

UseGalaxy.org.au

UseGalaxy.org (Main)

Published: Feb 6, 2024

Last modification: Aug 10, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00415

version Revision: 7

In proteomic research, the primary goal is to obtain accurate and meaningful insights into the proteome of a biological system. Verifying the presence of peptides or proteins is a critical step in achieving this goal, ensuring the quality and reliability of the data and the biological relevance of the findings. This tutorial is a sequel to the clinical metaproteomics discovery workflow. Once you have identified microbial peptides, the next step is to verify these peptides, for which we use PepQuery.

The PepQuery tool is used to validate the identified microbial peptides from SearchGUI/PeptideShaker and MaxQuant, to ensure that they are indeed of microbial origin and that human peptides were not misassigned. To do this, all confident microbial peptides from the two database search algorithms were merged and searched against the Human UniProt Reference proteome (with Isoforms) and cRAP databases.

Interestingly, the PepQuery tool does not rely on searching peptides against a reference protein sequence database as “traditional” shotgun proteomics does, which enables it to identify novel, disease-specific sequences with sensitivity and specificity in its protein validation (Figure A). Then we extract microbial protein sequences that are assigned to the PepQuery verified peptides. To this, we again add the Human UniProt Reference proteome (with Isoforms) and cRAP databases for creating a database for quantitation purposes (Figure B).

Peptide Verification.

Database generation from verified peptides.

Agenda

In this tutorial, we will cover:

Get data

Import Workflow

Extraction of Microbial Peptides from SearchGUI/PeptideShaker and MaxQuant

Concatenate peptides from MaxQuant and SGPS for PepQuery2

Creating input database for PepQuery2

Peptide verification

Collapsing all the data

Filtering out confident peptides

Querying verified peptides

Retrieve UniProt IDs for distinct peptides

Generate FASTA database from UniProt IDs

Generating compact database

Conclusion

Get data

Hands On: Data Upload
Create a new history for this tutorial
Import the files from Zenodo or from the shared data library (GTN - Material -> proteomics -> Clinical Metaproteomics 3: Verification):
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F10_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F11_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F13_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F15_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/SGPS_Peptide_Report.tabular
https://zenodo.org/records/10105821/files/MaxQuant_Peptide_Report.tabular
https://zenodo.org/records/10105821/files/Distinct_Peptides_for_PepQuery.tabular
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check that the datatype

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select datatypes from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Add to each database a tag corresponding to input files.

Users can create a database collection of the MGF files.

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Import Workflow

Hands On: Running the Workflow

Import the workflow into Galaxy:

Hands On: Importing and launching a GTN workflow

Launch Verification Workflow (View on GitHub, Download workflow) workflow.

Click to Launch Verification Workflow (View on GitHub, Download workflow)

Click on galaxy-workflows-activity Workflows in the Galaxy activity bar (on the left side of the screen, or in the top menu bar of older Galaxy instances). You will see a list of all your workflows

Click on galaxy-upload Import at the top-right of the screen

Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/clinical-mp-3-verification/workflows/WF3_Verification_Workflow.ga

Click the Import workflow button

Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

Video: Importing a workflow from URL

Run Workflow workflow using the following parameters:

“Send results to a new history”: No

param-file ” Input MGFs Dataset Collection “: MGF dataset collection

param-file ” SGPS_peptide-report”: SGPS_Peptide_Report.tabular

param-file ” Distinct Peptides for PepQuery”: Distinct_Peptides_for_PepQuery.tabular

param-file ” MaxQuant-peptide-report “: MaxQuant_Peptide_Report.tabular

Click on Workflows on the Activity Bar on the left.

At the top of the resulting page you will have the option to switch between the My workflows, Workflows shared with me and Public workflows tabs.

Select the tab you want to see all workflows in that category

Search for your desired workflow.

Click on the workflow name: a pop-up window opens with a preview of the workflow.

To run it directly: click Run (top-right).

Recommended: click Import (left of Run) to make your own local copy under Workflows / My Workflows.

Extraction of Microbial Peptides from SearchGUI/PeptideShaker and MaxQuant

Now that we have identified microbial peptides from SearchGUI/PeptideShaker and MaxQuant, we need to extract the microbial peptide sequences and group them to obtain a list of distinct microbial peptides. This list of distinct peptides will be used as input for PepQuery2 to verify confident microbial peptides.

First, we will use the Cut tool to select the peptide and protein columns from the SearchGUI/PeptideShaker and MaxQuant Peptide Reports. Then we use Remove header lines from SGPS and MaxQuant to prepare for concatenation with Remove beginning.

Hands On: Extracting peptides

Cut with the following parameters:

“Cut columns”: c6,c2

param-file “From”: output (Input dataset)

Cut with the following parameters:

“Cut columns”: c1,c35

param-file “From”: output (Input dataset)

Remove beginning with the following parameters:

param-file “from”: out_file1 (output of Cut tool)

Remove beginning with the following parameters:

param-file “from”: out_file1 (output of Cut tool)

Concatenate peptides from MaxQuant and SGPS for PepQuery2

We will now concatenate the peptide and protein datasets from SearchGUI/PeptideShaker and MaxQuant. Later, we will generate a list of confident peptides using PepQuery2. The list of confident peptides will be searched against the concatenated peptide-protein datasets from SearchGUI/PeptideShaker and MaxQuant to generate a list of verified peptides.

Hands On: Concatenate SGPS and MaxQuant peptides

Concatenate datasets ( Galaxy version 0.1.1) with the following parameters:

param-files “Datasets to concatenate”: out_file1 (output of Remove beginning tool), out_file1 (output of Remove beginning tool)

Creating input database for PepQuery2

We generate and merge Human UniProt (with Isoforms) and contaminants (cRAP) to make an input database for PepQuery2.

Hands On: FASTA Merge Files and Filter Unique Sequences

FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:

“Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)

In “Input FASTA File(s)”:

param-repeat “Insert Input FASTA File(s)”

param-file “FASTA File”: Human UniProt+Isoforms FASTA (output of Protein Database Downloader tool)

param-file “FASTA File”: cRAP database (output of Protein Database Downloader tool)

Peptide verification

The PepQuery2 tool will be used to validate the identified microbial peptides from SearchGUI/PeptideShaker and MaxQuant to ensure that they are indeed of microbial origin and that human peptides were not misassigned. We will use the list of Distinct Peptides (from the Discovery Module), Human UniProt+Isoforms+cRAP database, and our MGF file collection as inputs for PepQuery2. The outputs we are interested in are the four PSM Rank (txt) files (one for each MGF file).

Interestingly, the PepQuery2 tool does not rely on searching peptides against a reference protein sequence database as “traditional” shotgun proteomics does, which enables it to identify novel, disease-specific sequences with sensitivity and specificity in its protein validation. More information about PepQuery is available, including the first Wen et al. 2019 and second iterations Wen and Zhang 2023.

Hands On: Peptide verification

PepQuery2 ( Galaxy version 2.0.2+galaxy0) with the following parameters:

“Validation Task Type”: novel peptide/protein validation

In “Input Data”:

“Input Type”: peptide

“Peptides?”: Peptide list from your history

param-file “Peptide Sequences (.txt)”: output (Input dataset)

“Protein Reference Database from”: history

param-file “Protein Reference Database File”: output (output of FASTA Merge Files and Filter Unique Sequences tool)

“MS/MS dataset to search”: ` Spectrum Datasets from history`

param-collection “Spectrum File”: output (Input dataset collection)

“Report Spectrum Scan as”: spectrum title in MGF

In “Modifications”:

“Fixed modification(s)”: 1: Carbamidomethylation of C [57.02146372057] 13: TMT 11-plex of K [229.16293213472] 14: TMT 11-plex of peptide N-term [229.16293213472]

“Variable modification(s)”: 2: Oxidation of M [15.99491461956]

“Use more stringent criterion for unrestricted modification searching”: Yes

“Consider amino acid substitution modifications?”: Yes

In “Digestion”:

“Enzyme”: Trypsin

“Max Missed Cleavages”: 2

In “Mass spectrometer”:

In “Tolerance”:

“Precursor Tolerance”: 10

“Precursor Unit”: ppm

“Tolerance”: 0.6

In “PSM”:

“Fragmentation Method”: CID/HCD

“Scoring Method”: HyperScore

“Minimum Charge”: 2

“Maximum Charge”: 6

Collapsing all the data

Remember that PepQuery2 generates a PSM Rank file for each input MGF file, so we will have four PSM Rank files. To make the analysis more efficient, we will collapse these four PSM Rank files into one dataset.

Hands On: Collasping PSM rank files into a singular dataset using Collapse Collection

Collapse Collection ( Galaxy version 5.1.0) with the following parameters:

param-file “Collection of files to collapse into single dataset”: psm_rank_txt (output of PepQuery2 tool)

“Keep one header line”: Yes

Filtering out confident peptides

Now, we want to filter for confident peptides from PepQuery2 and prepare them for the Query Tabular tool.

Hands On: Filter

Filter with the following parameters:

param-file “Filter”: output (output of Collapse Collection tool)

“With following condition”: c20=='Yes'

“Number of header lines to skip”: 1

Hands On: Remove header line from filtered PepQuery peptides with Remove beginning

Remove beginning with the following parameters:

param-file “from”: out_file1 (output of Filter tool)

Hands On: Cut (select out) peptide sequences from PepQuery output with Cut

Cut with the following parameters:

“Cut columns”: c1

param-file “From”: out_file1 (output of Remove beginning tool)

Querying verified peptides

We will use the Query Tabular tool Johnson et al. 2019 to search the PepQuery-verified peptides against the concatenated dataset that contains peptides and proteins from SearchGUI/Peptide and MaxQuant. This step ensures all the PepQuery-verified peptides are assigned to their protein/protein groups.

Hands On: Querying verified peptides

Query Tabular ( Galaxy version 3.3.0) with the following parameters:

In “Database Table”:

param-repeat “Insert Database Table”

param-file “Tabular Dataset for Table”: out_file1 (output of Cut tool)

In “Table Options”:

“Specify Name for Table”: pep

“Specify Column Names (comma-separated list)”: mpep

param-repeat “Insert Database Table”

param-file “Tabular Dataset for Table”: out_file1 (output of Concatenate datasets tool)

In “Table Options”:

“Specify Name for Table”: prot

“Specify Column Names (comma-separated list)”: pep,prot

“SQL Query to generate tabular output”: select pep.mpep, prot.protFROM pep INNER JOIN prot on pep.mpep=prot.pep `

“include query result column headers”: Yes `

Comment: SQL Query information

The query input files are the list of peptides and the peptide report we obtained from MaxQuant and SGPS. The query is matching each peptide (m.pep) from the PepQuery results to the peptide reports so that each verified peptide has its protein/protein group assigned to it.

Hands On: Remove Header with Remove beginning

Remove beginning with the following parameters:

param-file “from”: output (output of Query Tabular tool)

Using the Group tool, we can select distinct (unique) peptides and proteins from the Query Tabular tool.

Hands On: Extract distinct peptides with Group

Group with the following parameters:

param-file “Select data”: out_file1 (output of Remove beginning tool)

“Group by column”: c1

In “Operation”:

param-repeat “Insert Operation”

“Type”: Concatenate Distinct

“On column”: c2

Retrieve UniProt IDs for distinct peptides

Again, we will use the Query Tabular tool to retrieve UniProt IDs (accession numbers) for the distinct (grouped) peptides.

Hands On: Query Tabular

Query Tabular ( Galaxy version 3.3.0) with the following parameters:

In “Database Table”:

param-repeat “Insert Database Table”

param-file “Tabular Dataset for Table”: out_file1 (output of Group tool)

In “Filter Dataset Input”:

In “Filter Tabular Input Lines”:

param-repeat “Insert Filter Tabular Input Lines”

“Filter By”: normalize list columns, replicates row for each item in list

“enter column numbers to normalize”: 2

“List item delimiter in column”: ;

param-repeat “Insert Filter Tabular Input Lines”

“Filter By”: regex replace value in column

“enter column number to replace”: 2

“regex pattern”: (tr|sp)[|]

param-repeat “Insert Filter Tabular Input Lines”

“Filter By”: regex replace value in column

“enter column number to replace”: 2

“regex pattern”: [ ]+

param-repeat “Insert Filter Tabular Input Lines”

“Filter By”: regex replace value in column

“enter column number to replace”: 2

“regex pattern”: [|].*$

In “Table Options”:

“Specify Name for Table”: t1

“Use first line as column names”: Yes

“Specify Column Names (comma-separated list)”: pep,prot ` “SQL Query to generate tabular output”: SELECT distinct(prot) AS Accession from t1

“include query result column headers”: No

Question

What is the accession number of a protein?

Can there be multiple accession numbers for one peptide or protein?

An accession number of a protein, also called a protein accession number, is a unique identifier assigned to a specific protein sequence in a protein sequence database. These accession numbers are used to reference and catalog proteins in a standardized and systematic manner

Yes, a single peptide or protein can have multiple accession numbers, particularly when dealing with different protein sequence databases, databases for specific species, or different versions of the same database. That’s the reason in our workflow we merge both accession and sequences.

Generate FASTA database from UniProt IDs

Using the UniProt IDs from Query Tabular, we will be able to generate a FASTA database for our PepQuery-verified peptides.

Hands On: UniprotXML-downloader

UniProt ( Galaxy version 2.4.0) with the following parameters:

“Select”: A history dataset with a column containing Uniprot IDs

param-file “Dataset (tab separated) with ID column”: output (Input dataset)

“Column with ID”: c1

“Field”: Accession

“uniprot output format”: fasta

Generating compact database

Lastly, we will merge the Human UniProt (with isoforms), contaminants (cRAP) and the PepQuery-verified FASTA databases into one Quantitation Database that will be used as input for the Quantification Module.

Hands On: Generation of Compact Verified Database with UniProt

FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:

“Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)

In “Input FASTA File(s)”:

param-repeat “Insert Input FASTA File(s)”

param-file “FASTA File”: proteome (output of UniProt tool)

Conclusion

A peptide verification workflow is a critical step in proteomic research that enhances data reliability, quantitative accuracy, and biological understanding by confirming the presence and validity of selected peptides. It is a pivotal quality control process that ensures the trustworthiness of proteomic findings and supports downstream investigations. By completing this tutorial, you have not only verified the microbial peptides but also created a database consisting of protein sequences from the PepQuery-verified peptides. The need of such a database is to ensure that when we quantify our proteins and peptides we are reducing the introduction of false positives. This database will be now used for quantitation purposes.

You've Finished the Tutorial

Key points

Perform verification

Extraction of accession numbers for getting protein sequences

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Johnson, J. E., P. Kumar, C. Easterly, M. Esler, S. Mehta et al., 2019 Improve your Galaxy text life: The Query Tabular Tool. F1000Research 7: 1604. 10.12688/f1000research.16450.2
Wen, B., X. Wang, and B. Zhang, 2019 PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations. Genome Research 29: 485–493. 10.1101/gr.235028.118
Wen, B., and B. Zhang, 2023 PepQuery2 democratizes public MS proteomics data for rapid peptide searching. Nature Communications 14: 10.1038/s41467-023-37462-4

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Subina Mehta, Katherine Do, Dechen Bhuming, Clinical Metaproteomics 3: Verification (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/clinical-mp-3-verification/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{proteomics-clinical-mp-3-verification,
author = "Subina Mehta and Katherine Do and Dechen Bhuming",
	title = "Clinical Metaproteomics 3: Verification (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/clinical-mp-3-verification/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Clinical Metaproteomics 4: Quantitation

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/proteomics/tutorials/clinical-mp-3-verification/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: text_processing
  owner: bgruening
  revisions: d698c222f354
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: dbbuilder
  owner: galaxyp
  revisions: 983bf725dfc2
  tool_panel_section_label: Get Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: fasta_merge_files_and_filter_unique_sequences
  owner: galaxyp
  revisions: f546e7278f04
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: pepquery2
  owner: galaxyp
  revisions: a07976bbc4d9
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: uniprotxml_downloader
  owner: galaxyp
  revisions: a371252a2cf6
  tool_panel_section_label: Get Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: query_tabular
  owner: iuc
  revisions: cf34c344508d
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: collapse_collections
  owner: nml
  revisions: 90981f86000f
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.