MaxQuant and MSstats for the analysis of label-free data

Author(s)	Melanie Föll Matthias Fahrner
Reviewers

Overview
Questions:

How to perform label-free shotgun (DDA) data analysis in MaxQuant and MSstats?

Which proteins are differentially abundant in the two types of cutaneous squamous cell carcinomas?

Objectives:

Learn how to use MaxQuant and MSstats for the analysis of label-free shotgun (DDA) data

Requirements:

Introduction to Galaxy Analyses

tutorial Hands-on: Label-free data analysis using MaxQuant

Time estimation: 2 hours

Level: Intermediate Intermediate

Supporting Materials:

Workflows

galaxy-history-answer Answer Histories

UseGalaxy.eu - MaxQuant and MSstats
2025-09-21

help How to Use This

FAQs

video Recordings

video Tutorial (February 2021) - 1h

video View All

instances Available on these Galaxies

Possibly Working

UseGalaxy.cz

UseGalaxy.eu

UseGalaxy.no

UseGalaxy.org (Main)

UseGalaxy.org.au

Published: Feb 17, 2021

Last modification: Sep 22, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00219

rating Rating: 5.0 (2 recent ratings, 10 all time)

version Revision: 19

Modern mass spectrometry-based proteomics enables the identification and quantification of thousands of proteins. Therefore, quantitative mass spectrometry represents an indispensable technology for biological and clinical research. Statistical analyses are required for the unbiased answering of scientific questions and to uncover all important information in the proteomic data. Classical statistical approaches and methods from other omics technologies are not ideal because they do not take into account the speciality of mass spectrometry data that include several thousands of proteins but often only a few dozens of samples (referred to as ‘curse of dimensionality’) and stochastic data properties that reflect sample preparation and spectral acquisition (Choi 2014).

In this training we will cover the full analysis workflow from label-free, data dependent acquisition (DDA) raw data to statistical results. We’ll use two popular quantitative proteomics software: MaxQuant and MSstats. MaxQuant allows protein identification and quantification for many different kinds of proteomics data (Cox and Mann 2008). In case you have no previous experience with MaxQuant, we recommend to go through the MaxQuant beginners tutorial before. MSstats provides statistical functionalities to find differentially abundant peptides or proteins from data dependent acquisition (DDA), data independent acquisition (DIA) or single reaction monitoring (SRM) proteomic experiments. The training dataset consists of a skin cancer cohort of 19 patients, which is a subset of a published study Föll et al. 2018. One fifth of all non melanoma skin cancers are cutaneous squamous cell carcinomas (cSCC) that mainly derive from exposure to ultraviolet light. Most cSCC have a good prognosis but the few metastasizing cSCC have dramatically increased mortality. Here, we compare these metastasizing cSCC to cSCC in patients with the genetic disease recessive dystrophic epidermolysis bullosa (RDEB). RDEB is a genetic skin blistering and extracellular matrix disease caused by collagen VII deficiency. To investigate molecular differences between these two aggressive cSCCs with different origin, we used global proteomic analysis of formalin-fixed paraffin-embedded human cSCC tissues.

Agenda

In this tutorial, we will cover:

Get data

MaxQuant analysis

MSstats analysis

More details on MSstats

Follow up on MSstats results

Filtering MSstats results

Finding differentially abundant proteins

Follow up on differentially abundant proteins

Get data

The annotation file, group comparison file and FASTA file for this training is deposited at Zenodo. It is of course possible to use another FASTA file with human proteome sequences, but to ensure that the results are compatible we recommend to use the provided FASTA file. MaxQuant not only adds known contaminants to the FASTA file, but also generates the “decoy” hits for false discovery rate estimation itself, therefore the FASTA file is not allowed to have decoy entries. To learn more about FASTA files, have a look at Protein FASTA Database Handling tutorial. The raw data is available via the PRIDE repository. As this is a real life study, the raw data sizes are large and computation time in MaxQuant is long. To save time and storage capacity, you can skip downloading the raw data and the MaxQuant run and instead continue with the MaxQuant outputs which we provide later on. In this case skip the data upload steps 5-8 which are only necessary for the MaxQuant run.

Hands On: Data upload
Create a new history for this tutorial and give it a meaningful name

To create a new history simply click the new-history icon at the top of the history panel:
Import the FASTA database, annotation file and comparison matrix from Zenodo
https://zenodo.org/records/17177151/files/input_protein_database.fasta
https://zenodo.org/records/17177151/files/input_annotation_file.tabular
https://zenodo.org/records/17177151/files/input_comparison_matrix.tabular
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Once the files are green, rename the fasta file into ‘protein database’, the annotation file into ‘annotation file’ and the comparison matrix file into ‘comparison matrix’.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button
You can save time and storage capacity when running this tutorial by skipping the first step and not running MaxQuant. In that case, upload the MaxQuant results listed in the MaxQuant analysis section. If you want to run MaxQuant, import the raw data from PRIDE in a collection:
Open the upload menu and slect the “Rule Based” tab.
Paste the data urls:
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment105_metast_cSCC1.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment106_metast_cSCC2.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment107_metast_cSCC3.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment109_metast_cSCC4.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment116_metast_cSCC5.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment117_metast_cSCC6.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment118_metast_cSCC7.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment119_metast_cSCC8.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment120_metast_cSCC9.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment121_metast_cSCC10.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment122_metast_cSCC11.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment123_metast_cSCC12.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment124_metast_cSCC13.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment110_RDEB_cSCC1.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment111_RDEB_cSCC2.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment112_RDEB_cSCC3.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment113_RDEB_cSCC4.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment114_RDEB_cSCC5.raw
https://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment115_RDEB_cSCC6.raw
Upload Type Collections

Click on Build

We need our datasets to be named ‘metast_cSCC1.raw’, ‘metast_cSCC2.raw’, etc.. The naming for the raw files have to be exactly this way to later match the file names provided in the MSstats annotation file.

Select + Column, using a Regular Expression, then “Create columns matching expression groups”. Use the expression Experiment[0-9]+_(.+) and Click Apply at the end. This will extract the file names from the url.

Add column with the fixed value thermo.raw to specify the format of the files

Add column with the fixed value raw_files. This will be the collection name.

We will now add columns definitions: Select + Rules, then Add/Modify Column definitions. Select + Add Defenitions and Set the URL as column A, List Identifiers as column B, Collection Name as column D, and Type as column C. Then Apply.

Click on Upload to start the upload.
Open image in new tab

Figure 1: Rule Based Upload

MaxQuant analysis

The run time of MaxQuant tool depends on the number and size of the input files and on the chosen parameters. The run of the training datasets will take a few hours, but the training can be directly continued with the MaxQuant result files from Zenodo. We start the MaxQuant run with the default parameters, with a few adjustments. Protein level quantification parameters do not really matter here, because MSstats will use feature quantifications and perform protein summarization based on them. A quality control report is generated with the PTXQC functionality Bielow et al. 2015 that is directly implemented in the MaxQuant Galaxy tool. To continue with statistical analysis in MSstats, the Protein Groups and the Evidence files are needed from MaxQuant.

Hands On: Optional: MaxQuant analysis

MaxQuant ( Galaxy version 2.0.3.0+galaxy0) with the following parameters:

In “Input Options”:

param-file “FASTA files”: protein database

In “Search Options”:

“minimum unique peptides”: 1

“Match between runs”: yes

In “Parameter Group”:

param-collection “Infiles”: raw_files

“Generate PTXQC (proteomics quality control pipeline) report? (experimental setting)”: True

In “Output Options”:

“Select the desired outputs.”: Protein Groups Evidence

Because the MaxQuant run takes really long, we recommend to download the MaxQuant results from Zenodo and continue with the tutorial.

Hands On: Load MaxQuant results from Zenodo
Import the files from Zenodo
https://zenodo.org/records/17177151/files/MaxQuant%20Evidence.tabular
https://zenodo.org/records/17177151/files/MaxQuant%20Protein%20Groups.tabular
https://zenodo.org/records/17177151/files/PTXQC%20report.pdf
Rename the files Evidence, Protein Groups, and PTXQC report respectively.

Question

How many proteins and features were identified in total?

In which columns (number) are the potential contaminants in the protein group and evidence file respectively?

How large is the proportion of potential contaminants?

2512 protein groups and ~240000 features were found in total (number of lines of protein group and evidence files)

They are in column 138 (protein groups) and 50 (evidence)

Up to ~60% of the samples intensities derive from potential contaminants (PTXQC plots page 6)

MSstats analysis

The protein groups and evidence files of MaxQuant can directly be input into MSstats. MSstats automatically removes all proteins that are labelled as contaminants (‘+’ sign in the column ‘potential contaminant’ of both MaxQuant outputs). However, in this skin dataset we expect that the skin proteins are part of the sample and not a contamination. Therefore, we keep all human contaminants by first removing non human proteins (by selecting only lines that contain the word ‘HUMAN’ or a word from the header line) and then for the human potential contaminants remove the ‘+’ (replacing ‘+’ with ‘‘(empty field) in the ‘potential contaminant’ column) to keep them for the analysis.

We use the modified MaxQuant protein groups and evidence files as input in MSstats. In addition, an annotation file that describes the experimental design and a comparison matrix is needed. Please start the MSstats run first and while it is running you can find more details on its parameters below.

Hands On: MSstats Analysis

Select with the following parameters:

param-file “Select lines from”: Protein Groups (output of MaxQuant tool)

“the pattern”: (HUMAN)|(Majority)

Once finished, rename the output select protein groups

Select with the following parameters:

param-file “Select lines from”: Evidence (output of MaxQuant tool)

“the pattern”: (HUMAN)|(Sequence)

Once finished, rename the output select evidence

Replace Text in a specific column ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to process”: select protein groups (output of Select tool)

In “Replacement”: - “in column”: Column: 138 - “Find pattern”: +

Once finished, rename the file into protein groups input for MSstats

Replace Text in a specific column ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to process”: select evidence (output of Select tool)

In “Replacement”: - “in column”: Column: 50 - “Find pattern”: +

Once finished, rename the file into evidence input for MSstats

MSstats ( Galaxy version 4.0.0+galaxy1) with the following parameters:

“input source”: MaxQuant

param-file “evidence.txt - feature-level data”: evidence input for MSstats (output of Replace tool)

param-file “proteinGroups.txt - protein-level data”: protein groups input for MSstats (output of Replace tool)

param-file “annotation file”: annotation file

“Select Protein ID in evidence.txt”: Leading razor protein column

In “MaxQtoMSstatsFormat Options”:

“Remove the proteins which have only 1 peptide and charge”: Yes

In “dataProcess Options”:

“Select outputs”: MSstats log MSstats FeatureLevelData MSstats ProteinLevelData Sample Quantification Matrix Table Group Quantification Matrix Table

In “dataProcess Plot Options”:

“Select visualization outputs”: MSstats QCPlot

“Select protein IDs to draw plots”: Option for QC plot: "allonly" will generate one QC plot with all proteins

In “Advanced visualization parameters”:

“Angle of labels represented each condition at the top of graph”: 0

“Compare Groups”: Yes

param-file “Comparison Matrix”: comparison_matrix

“Select outputs”: MSstats ComparisonResult

In “Comparison Visualization Options”:

“Select visualization outputs”: MSstats VolcanoPlot

In “Advanced visualization parameters”:

“Involve fold change cutoff or not for volcano plot or heatmap”: 1.5

“For volcano plot or heatmap, logarithm transformation of adjusted p-valuewith base 2 or 10”: 2

“Display protein names in Volcano Plot”: No

Question

How many proteins were removed as potential non human contaminants?

How many proteins were included into the statistical analysis?

27 (2518 lines in protein group file minus 2491 lines after select)

1693 (MSstats log)

More details on MSstats

MSstats is designed for statistical modelling of mass spectrometry based proteomic data Choi et al. 2014. Proteomic data analysis requires statistical approaches that reduce bias and inefficiencies and distinguish systematic variation from random artifacts Käll and Vitek 2011.

MSstats is directly compatible with the output of several quantitative proteomics software. In addition to the results of the proteomics software an annotation file is needed as input. The annotation file describes the experimental design such as the conditions, biological and technical replicates. To be compatible with MaxQuant results, an additional column with the label type is needed, which only contains L (light) in DDA experiments. A wrong setup of the annotation file is the most common source of errors in MSstats, thus we collected more information in the box below to allow you to adjust the annotation file when analyzing your own experiments.

For label-free MaxQuant data, the annotation file should have 5 columns with exactly these headers: Raw.file, Condition, BioReplicate, Run; IsotopeLabelType

Raw.file: The names must match exactly to the file names in the MaxQuant evidence.txt “Raw file” column. (e.g. “file1.raw.thermo”).

Condition: The conditions which will be compared in the statistical modelling. They are not allowed to start with a number or contain any special characters except for ‘_’.

BioReplicate: This column should contain a unique identifier for each biological replicate in the experiment. For example, in a clinical proteomic investigation this should be a unique patient id. If technical replicates are present, all samples from the same biological replicate should have the same id but different run ids. MSstats automatically detects the presence of technical replicates and accounts for them in the model-based analysis.

Run: This column contains the identifier of a mass spectrometry run. Each mass spectrometry run should have a unique identifier (number or name), regardless of the origin of the biological sample.

IsotopeLabelType: This is L (light) for all MaxQuant DDA experiments.

MSstats will compare all conditions that are indicated in the comparison matrix. The comparison matrix has to be setup correctly to avoid errors and wrong statistical modelling. It contains a first column that gives the comparisons a name and one column per condition. This matrix is filled with 1 and -1 to specify the conditions that are compared in each comparison and with 0 for conditions that are not part of the comparison.

The first column of the comparison matrix contains the names of the comparisons and should have ‘names’ as header. These names will be used in all MSstats output files, therefore it is important that the names are meaningful and reflect the actual comparison (see below)

An additional column for each condition that is present in the data. This means each condition present in the annotation file has to be a separate column even when the condition will not be used for any comparison. The header should contain the condition name exactly as written in the annotation file.

Fill the matrix: Use 1 and -1 to indicate the conditions to compare and 0 for conditions that are not compared. Multiple groups can be combined by using 0.5.

Example: to compare condition1 with condition2: write 1 into the column of condition1 and write -1 in the column of condition2. In the first column of the matrix name this comparison condition1-condition2. The naming of the comparison should reflect the direction of the comparison and thus always have the condition that is set to 1 first and the condition that is set to -1 second (condition1-condition2 and NOT condition2-condition1).

The first analysis step in MSstats is the conversion of the input data into an MSstats compatible table. For this step several parameters to filter and adjust the input data can be selected. We keep the default parameters and only change one parameter in order to remove proteins which have only a single peptide measurement. Next, data processing optimizes the data for statistical modelling via log-transformation, and normalization of intensities, feature selection, missing value imputation, and run-level summarization. Log- transformation is performed to transform multiplicative signals to additive signals which are compatible with linear statistical models and bring the intensity distribution close to a normal distribution. Furthermore, it changes the dependence of variances from the intensity values: in the raw data larger intensities have larger variances but after log transformation lower intensities have larger variances.

Normalization aims to make the intensities of different runs more comparable to each other. The default normalization method, equalize medians, assumes that the majority of proteins do not change across runs and shifts all intensities of a run by a constant to obtain equal median intensities across runs. A feature in label-free DDA data corresponds to a peptide at a given charge state (m/z value), resulting from the identification of MS2 spectra combined with the quantitative information from the MS1 scans. Feature selection allows the use of either all, only the most abundant features or only high quality peptides for protein summarization.

Missing values and noisy features with outliers are typical in label-free DDA datasets but influence protein summarization. Therefore, it is recommended to perform missing value imputation. Missing values are reported differently in different Softwares. MaxQuant reports them as NA and MSstats assumes that missing intensity values from MaxQuant mean that the intensity was below the limit of quantification. This means the values are not missing for random but for the reason of low abundance. Therefore, the values are only partially known and called “censored”. This may also be the case for very low intensity values, which might not be reliable. The percentile that is not trusted and should be considered a censored value is defined via the “Maximum quantile for deciding censored missing values” parameter. Censored values are replaced by an intensity that is generated via an accelerated failure time model (AFT). Alternatively censored values may be replaced by the minimum value of the features, runs or both as defined in the “Cutoff value for censoring”. Runs with no intensity measurement for a protein will be removed for any further calculation on this protein.

Protein summarization is by default performed via Tukey’s median polish for robust parameter estimation with median across rows and columns. Run-level summaries are later used for statistical group comparison.

Any two groups can be compared to find differentially abundant proteins between them. MSstats uses a family of linear mixed models that are automatically adjusted for the comparison type according to the information in the annotation file, such as conditions, biological and technical replicates and runs. This allows comparison of groups with different sizes; comparison of the mean of some groups, paired designs and time course experiments.

Follow up on MSstats results

We obtain several output files from MSstats. MSstats log file contains the MSstats report with warnings and information about the analysis steps. The MSstats QCPlot visualizes the log transformed intensities for all proteins and runs. The volcano plot plots the statistical result as transformed p-values vs. the log2 fold change. A fold change of 1.5 means that a protein is 50% more abundand in one condition than the other. The log2 fold change is 0.58.

qc plot. — **Figure 2**: QC plot of all proteins

volcano plot. — **Figure 3**: Volcano plot showing p-values and log10 fold changes for all proteins. Dashed line indicates p-value of 0.05 and log2 fold change of ± 0.58

The FeatureLevelData file contains the transformed, normalized and imputed intensities for each peptide in each run. ProteinLevelData data summarizes intensities per run and protein. We’ll count and visualize the number of features per run and calculate the distribution of proteins per sample.

Hands On: Follow up on MSstats results

Summary Statistics with the following parameters:

param-file “Summary statistics on”: ProteinLevelData (output of MSstats tool)

“Column or expression”: c8

Datamash ( Galaxy version 1.9+galaxy0) with the following parameters:

param-file “Input tabular dataset”: ProteinLevelData (output of MSstats tool)

“Group by fields”: 4

“Sort input”: Yes

“Input file has a header line”: Yes

“Print header line”: Yes

In “Operation to perform on each group”:

“Type” : count

“On column”: Column: 1

Click on galaxy-barchart “Visualize this data” on the Datamash tool result.

Select Bar, Line and Scatter

Click Expand galaxy-vis-config at the top right of the chart.

“Provide a title”: Number of features per sample

“Column of x-axis values”: Column: 1

Save galaxy-save (file is saved under “Visualization” –> “Saved Visualizations”)

Question

Which sample has the lowest amount of proteins after protein summarization?

In the complete experiment, how many features has a protein on average?

RDEB cSCC4

Open image in new tab

Figure 4: Number of proteins per sample (run)

Around 6 features per protein (mean in summary statistics).

Filtering MSstats results

The comparison result table summarizes the statistical results per protein and comparison. First, we keep only the Uniprot ID in column 1 to make the ID less cluttered. This is done by deleting everything before the first pipe ‘|’ and everything after the second pipe ‘|’. Then we keep only statistically significant proteins that means they have an adjusted p-value below 0.05. Next, we separate up- and down-regulated proteins by filtering for a positive and negative log2FC. The Sample Quantification Matrix Table contains the summarized intensities per protein and sample. In order to make its IDs compatible with the ones from the comparison result at a later step, we keep only the Uniprot ID as well.

Hands On: Filtering MSstats results

Replace Text in a specific column ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to process”: Comparison Result (output of MSstats tool)

In “Replacement”:

“in column”: Column: 1

“Find pattern”: sp\|

param-repeat “Insert Replacement”

“in column”: Column: 1

“Find pattern”: \|.*

Once finished, rename the output replaced comparison result

Filter data on any column using simple expressions with the following parameters:

param-file “Filter”: replaced comparison result (output of Replace Text tool)

“With following condition”: c8<0.05

“Number of header lines to skip”: 1

Once finished, rename the file into significant proteins

Filter on any column using simple expressions with the following parameters:

param-file “Filter”: significant proteins (output of Filter tool)

“With following condition”: c3>0.58

“Number of header lines to skip”: 1

Add a tag #metastasized to the filtered file and rename it into metastasized filtered

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Filter on any column using simple expressions with the following parameters:

param-file “Filter”: significant proteins (output of the first Filter tool)

“With following condition”: c3<-0.58

“Number of header lines to skip”: 1

Add a tag #rdeb to the filtered file and rename it into rdeb filtered

Question

Why do we filter for the adjusted p-value?

How many proteins have a p-value below 0.05?

Adjusted p-values control for the multiplicity of testing. Since we fit a separate model, and conduct a separate comparison for each protein, the number of tests equals the number of comparisons. A 0.05 cutoff of an adjusted p-value controls the False Discovery Rate in the collection of tests over all the proteins at 5%. Since they account for the multiplicity, adjusted p-values are more conservative (i.e. it is more difficult to detect a change).

198 in total (first filtering step); 178 are upregulated in metastasized cSCC (metastasized filtered) and 15 are upregulated in RDEB cSCC (rdeb filtered).

Finding differentially abundant proteins

For each condition we select only the significant proteins, which are proteins with a p-value above 0 and below 0.05. Proteins with a p-value of 0 are missing in one condition and are therfore discarded in the next steps. We’ll keep only the column with the Uniprot ID and extract the average protein intensities per sample from the sample quantification matrix file and vizualize them at heatmap. We do the exact same steps for both conditions, therefore, each time you start a tool (except for the replace step) you can use the multiple input file to start the step for the metastasized and rdeb files at the same time.

Hands On: filter differentially abundant proteins

Filter on any column using simple expressions with the following parameters:

param-file “Filter”: metastasized filtered (output of Filter tool)

“With following condition”: c8>0

“Number of header lines to skip”: 1

Rename the file into significant metastasized

Cut with the following parameters:

“Cut columns”: c1

param-file “From”: significant metastasized (output of last Filter tool)

Rename the file into metastasized cut

Filter on any column using simple expressions with the following parameters:

param-file “Filter”: rdeb filtered (output of Filter tool)

“With following condition”: c8>0

“Number of header lines to skip”: 1

Rename the file into significant rdeb

Cut with the following parameters:

“Cut columns”: c1

param-file “From”: significant rdeb (output of last Filter tool)

Rename the file into rdeb cut

Replace Text in a specific column ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to process”: Sample Quantification Matrix (output of MSstats tool)

In “Replacement”:

“in column”: Column: 1

“Find pattern”: sp\|

param-repeat “Insert Replacement”

“in column”: Column: 1

“Find pattern”: \|.*

Rename the file into replaced sample quantification matrix

Join two files ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: replaced sample quantification matrix (output of Replace text tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: metastasized cut (output of Cut tool)

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Rename the file into metastasized join

Join two files ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: replaced sample quantification matrix (output of Replace Text tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: rdeb cut (output of Cut tool)

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Rename the file into rdeb join

heatmap2 ( Galaxy version 3.2.0+galaxy1) with the following parameters:

param-file “Input should have column headers - these will be the columns that are plotted”: metastasized join (output of Join tool)

“Plot title”: Upregulated proteins in metastasized cSCC

“Scale data on the plot (after clustering)”: Scale my data by row

“Enable data clustering”: No

“Output format “: PNG

heatmap2 ( Galaxy version 3.2.0+galaxy1) with the following parameters:

param-file “Input should have column headers - these will be the columns that are plotted”: rdeb join (output of Join tool)

“Plot title”: Upregulated proteins in RDEB cSCC

“Scale data on the plot (after clustering)”: Scale my data by row

“Enable data clustering”: No

“Output format “: PNG

Question

How many proteins are differentially abundant?

125 are upregulated in metastasized cSCC and 14 are upregulated in RDEB cSCC (number of lines minus 1 in the first filtering step per condition).

Follow up on differentially abundant proteins

In addition we retrieve for each Uniprot ID the corresponding protein names from uniprot to allow an easier interpretation.

Hands On: MSstats visualizations

UniProt ID mapping and retrieval ( Galaxy version 0.7) (Do not use 0.6, you may need to change the version on some instances) with the following parameters:

param-file “Input file with IDs”: metastasized join (output of Join tool)

“ID column”: Column: 1

“Do you want to map IDs or retrieve data from UniProt”: Retrieve: request entries by uniprot accession using batch retrieval

Rename the file into metastasized uniprot

UniProt ID mapping and retrieval ( Galaxy version 0.7) (Do not use 0.6, you may need to change the version on some instances) with the following parameters:

param-file “Input file with IDs”: rdeb join (output of Join tool)

“ID column”: Column: 1

“Do you want to map IDs or retrieve data from UniProt”: Retrieve: request entries by uniprot accession using batch retrieval

Rename the file into rdeb uniprot

FASTA-to-Tabular ( Galaxy version 1.1.1) with the following parameters:

param-file “Convert these sequences”: metastasized uniprot (output of UniProt tool)

“How many columns to divide title string into?”: 2

FASTA-to-Tabular ( Galaxy version 1.1.1) with the following parameters:

param-file “Convert these sequences”: rdeb uniprot (output of UniProt tool)

“How many columns to divide title string into?”: 2

Three of the differentially abundant proteins found here were also found and stained with antibodies in the original publication: Collagen XIV which is higher in RDEB cSCC than in metastasizing cSCC and Serum amyloid P-component as well as X-ray repair cross-complementing protein 6, which are both higher in metastasizing cSCC than in RDEB cSCC. Collagen XIV is a fibril associated collagen which may have tissue stabilizing function in the dermis. The upregulation of collagen XIV as well as other collagens in RDEB cSCC could be a compensation effort for the impaired collagen VII in RDEB tissues. Collagen VII is indeed only found in some RDEB samples and with such low intensities that it appears as upregulated in metastasized cSCC.

col14 staining. — **Figure 5**: Immunoflourescence staining of collagen XIV in RDEB and metastasizing cSCC skin tissues

You've Finished the Tutorial

Key points

MaxQuant offers a single tool solution for protein identification and quantification.

Label-free quantitation reveals the most abundant proteins in serum samples.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Käll, L., and O. Vitek, 2011 Computational Mass Spectrometry–Based Proteomics (F. Lewitter, Ed.). PLoS Computational Biology 7: e1002277. 10.1371/journal.pcbi.1002277
Choi, M., C.-Y. Chang, T. Clough, D. Broudy, T. Killeen et al., 2014 MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 30: 2524–2526. 10.1093/bioinformatics/btu305 https://academic.oup.com/bioinformatics/bioinformatics/article/2748156/MSstats: ISBN: 1367-4811 (Electronic) 1367-4803 (Linking)
Bielow, C., G. Mastrobuoni, and S. Kempa, 2015 Proteomics Quality Control: Quality Control Software for MaxQuant Results. Journal of Proteome Research 15: 777–787. 10.1021/acs.jproteome.5b00780
Föll, M. C., M. Fahrner, C. Gretzmeier, K. Thoma, M. L. Biniossek et al., 2018 Identification of tissue damage, extracellular matrix remodeling and bacterial challenge as common mechanisms associated with high-risk cutaneous squamous cell carcinomas. Matrix Biology 66: 1–21. 10.1016/j.matbio.2017.11.004

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Melanie Föll, Matthias Fahrner, MaxQuant and MSstats for the analysis of label-free data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/maxquant-msstats-dda-lfq/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{proteomics-maxquant-msstats-dda-lfq,
author = "Melanie Föll and Matthias Fahrner",
	title = "MaxQuant and MSstats for the analysis of label-free data (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/maxquant-msstats-dda-lfq/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/proteomics/tutorials/maxquant-msstats-dda-lfq/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: text_processing
  owner: bgruening
  revisions: 6073bb457ec0
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: 6073bb457ec0
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: uniprot_rest_interface
  owner: bgruening
  revisions: 468c71dac78a
  tool_panel_section_label: Convert Formats
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: fasta_to_tabular
  owner: devteam
  revisions: e7ed3c310b74
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: maxquant
  owner: galaxyp
  revisions: 9c52362ae7bb
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: msstats
  owner: galaxyp
  revisions: de7593c29499
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: datamash_ops
  owner: iuc
  revisions: 4c07ddedc198
  tool_panel_section_label: Join, Subtract and Group
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ggplot2_heatmap2
  owner: iuc
  revisions: 48b42ee61bc5
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 4

4 stars 5

3 stars 1

January 2026

5 stars: Liked: The support material was very understandable and precise Disliked: Everything looked perfect

May 2025

5 stars: Liked: it is easy to follow the instruction Disliked: adding the output to each step so we can check

June 2023

4 stars: Liked: It is easy to follow and explains the relevant background well Disliked: The MSstats volcano plot from the tutorial results in two proteins at the extreme ends (top left and right corners). This is seen in other datasets as well. The tutorial does not explain why this artefact occurs or whether these are true protein hits

May 2023

4 stars: Liked: Very clear presentation and rather easy to follow. I really enjoyed it. Disliked: The filtering part was a extremely fast. ;)

March 2022

5 stars: Liked: Great introduction to proteomics tools

July 2021

4 stars: Liked: the interactive tools Disliked: the tools were difficult to find. please mention the tools very specifically in the tutorial