MaxQuant and MSstats for the analysis of label-free data

Author(s) orcid logoMelanie Föll avatar Melanie Föllorcid logoMatthias Fahrner avatar Matthias Fahrner
Reviewers Björn Grüning avatarMelanie Föll avatarHelena Rasche avatar
Overview
Creative Commons License: CC-BY Questions:
  • How to perform label-free shotgun (DDA) data analysis in MaxQuant and MSstats?

  • Which proteins are differentially abundant in the two types of cutaneous squamous cell carcinomas?

Objectives:
  • Learn how to use MaxQuant and MSstats for the analysis of label-free shotgun (DDA) data

Requirements:
Time estimation: 2 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Feb 17, 2021
Last modification: Sep 3, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00219
rating Rating: 4.0 (0 recent ratings, 5 all time)
version Revision: 14

Modern mass spectrometry-based proteomics enables the identification and quantification of thousands of proteins. Therefore, quantitative mass spectrometry represents an indispensable technology for biological and clinical research. Statistical analyses are required for the unbiased answering of scientific questions and to uncover all important information in the proteomic data. Classical statistical approaches and methods from other omics technologies are not ideal because they do not take into account the speciality of mass spectrometry data that include several thousands of proteins but often only a few dozens of samples (referred to as ‘curse of dimensionality’) and stochastic data properties that reflect sample preparation and spectral acquisition (Choi 2014).

In this training we will cover the full analysis workflow from label-free, data dependent acquisition (DDA) raw data to statistical results. We’ll use two popular quantitative proteomics software: MaxQuant and MSstats. MaxQuant allows protein identification and quantification for many different kinds of proteomics data (Cox and Mann 2008). In case you have no previous experience with MaxQuant, we recommend to go through the MaxQuant beginners tutorial before. MSstats provides statistical functionalities to find differentially abundant peptides or proteins from data dependent acquisition (DDA), data independent acquisition (DIA) or single reaction monitoring (SRM) proteomic experiments. The training dataset consists of a skin cancer cohort of 19 patients, which is a subset of a published study Föll et al. 2018. One fifth of all non melanoma skin cancers are cutaneous squamous cell carcinomas (cSCC) that mainly derive from exposure to ultraviolet light. Most cSCC have a good prognosis but the few metastasizing cSCC have dramatically increased mortality. Here, we compare these metastasizing cSCC to cSCC in patients with the genetic disease recessive dystrophic epidermolysis bullosa (RDEB). RDEB is a genetic skin blistering and extracellular matrix disease caused by collagen VII deficiency. To investigate molecular differences between these two aggressive cSCCs with different origin, we used global proteomic analysis of formalin-fixed paraffin-embedded human cSCC tissues.

Agenda

In this tutorial, we will cover:

  1. Get data
  2. MaxQuant analysis
  3. MSstats analysis
    1. More details on MSstats
  4. Follow up on MSstats results
  5. Filtering MSstats results
  6. Finding differentially abundant proteins
  7. Follow up on differentially abundant proteins

Get data

The annotation file, group comparison file and FASTA file for this training is deposited at Zenodo. It is of course possible to use another FASTA file with human proteome sequences, but to ensure that the results are compatible we recommend to use the provided FASTA file. MaxQuant not only adds known contaminants to the FASTA file, but also generates the “decoy” hits for false discovery rate estimation itself, therefore the FASTA file is not allowed to have decoy entries. To learn more about FASTA files, have a look at Protein FASTA Database Handling tutorial. The raw data is available via the PRIDE repository. As this is a real life study, the raw data sizes are large and computation time in MaxQuant is long. To save time and storage capacity, you can skip downloading the raw data and the MaxQuant run and instead continue with the MaxQuant outputs which we provide later on. In this case skip the data upload steps 5-8 which are only necessary for the MaxQuant run.

Hands-on: Data upload
  1. Create a new history for this tutorial and give it a meaningful name

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the FASTA database, annotation file and comparison matrix from Zenodo

    https://zenodo.org/record/4896554/files/input_protein_database.fasta
    https://zenodo.org/record/4896554/files/input_annotation_file.tabular
    https://zenodo.org/record/4896554/files/input_comparison_matrix.tabular
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Once the files are green, rename the fasta file into ‘protein database’, the annotation file into ‘annotation file’ and the comparison matrix file into ‘comparison matrix’.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

  4. Steps 4 to 7 can be skipped to save time and storage capacity by not running MaxQuant. To run MaxQuant, import the raw data from PRIDE.

    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment105_metast_cSCC1.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment106_metast_cSCC2.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment107_metast_cSCC3.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment109_metast_cSCC4.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment116_metast_cSCC5.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment117_metast_cSCC6.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment118_metast_cSCC7.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment119_metast_cSCC8.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment120_metast_cSCC9.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment121_metast_cSCC10.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment122_metast_cSCC11.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment123_metast_cSCC12.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment124_metast_cSCC13.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment110_RDEB_cSCC1.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment111_RDEB_cSCC2.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment112_RDEB_cSCC3.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment113_RDEB_cSCC4.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment114_RDEB_cSCC5.raw
    ftp://ftp.pride.ebi.ac.uk/pride-archive/2017/11/PXD006914/Experiment115_RDEB_cSCC6.raw
    
    
  5. Rename the raw datasets into ‘metast_cSCC1.raw’, ‘metast_cSCC2.raw’, etc.. The naming for the raw files have to be exactly this way to later match the file names provided in the MSstats annotation file.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

  6. Control that the data type of the raw files is ‘thermo.raw’ otherwise change the datatype into ‘thermo.raw’

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select thermo.raw from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  7. Generate a collection for all raw files and name it ‘raw_files’, hide the individual raw files

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Build Dataset List

      build list collection menu item

    • Enter a name for your collection
    • Click Create collection to build your collection
    • Click on the checkmark icon at the top of your history again

MaxQuant analysis

The run time of MaxQuant tool depends on the number and size of the input files and on the chosen parameters. The run of the training datasets will take a few hours, but the training can be directly continued with the MaxQuant result files from Zenodo. We start the MaxQuant run with the default parameters, with a few adjustments. Protein level quantification parameters do not really matter here, because MSstats will use feature quantifications and perform protein summarization based on them. A quality control report is generated with the PTXQC functionality Bielow et al. 2015 that is directly implemented in the MaxQuant Galaxy tool. To continue with statistical analysis in MSstats, the Protein Groups and the Evidence files are needed from MaxQuant.

Hands-on: Optional: MaxQuant analysis
  1. MaxQuant ( Galaxy version 1.6.10.43+galaxy3) with the following parameters:
    • In “Input Options”:
      • param-file “FASTA files”: protein database
    • In “Search Options”:
    • “minimum unique peptides”: 1
    • “Match between runs”: yes
    • In “Parameter Group”:
      • param-collection “Infiles”: raw_files
    • “Generate PTXQC (proteomics quality control pipeline) report?”: Yes
    • In “Output Options”:
      • “Select the desired outputs.”: Protein Groups Evidence

Because the MaxQuant run takes really long, we recommend to download the MaxQuant results from Zenodo and continue with the tutorial.

Hands-on: Load MaxQuant results from Zenodo
  1. Import the files from Zenodo

    https://zenodo.org/record/4896554/files/MaxQuant_Evidence.tabular
    https://zenodo.org/record/4896554/files/MaxQuant_proteingroups.tabular
    https://zenodo.org/record/4896554/files/PTXQC_report.pdf
    
Question
  1. How many proteins and features were identified in total?
  2. In which columns (number) are the potential contaminants in the protein group and evidence file respectively?
  3. How large is the proportion of potential contaminants?
  1. 2622 protein groups and ~240000 features were found in total (number of lines of protein group and evidence files)
  2. They are in column 118 (protein groups) and 54 (evidence)
  3. Up to 60% of the samples intensities derive from potential contaminants (PTXQC plots page 7)

MSstats analysis

The protein groups and evidence files of MaxQuant can directly be input into MSstats. MSstats automatically removes all proteins that are labelled as contaminants (‘+’ sign in the column ‘potential contaminant’ of both MaxQuant outputs). However, in this skin dataset we expect that the skin proteins are part of the sample and not a contamination. Therefore, we keep all human contaminants by first removing non human proteins (by selecting only lines that contain the word ‘HUMAN’ or a word from the header line) and then for the human potential contaminants remove the ‘+’ (replacing ‘+’ with ‘‘(empty field) in the ‘potential contaminant’ column) to keep them for the analysis.

We use the modified MaxQuant protein groups and evidence files as input in MSstats. In addition, an annotation file that describes the experimental design and a comparison matrix is needed. Please start the MSstats run first and while it is running you can find more details on its parameters below.

Hands-on: MSstats Analysis
  1. Select with the following parameters:
    • param-file “Select lines from”: Protein Groups (output of MaxQuant tool)
    • “the pattern”: (HUMAN)|(Majority)
  2. Select with the following parameters:
    • param-file “Select lines from”: Evidence (output of MaxQuant tool)
    • “the pattern”: (HUMAN)|(Sequence)
  3. Replace Text ( Galaxy version 1.1.3) with the following parameters:
    • param-file “File to process”: select protein groups (output of Select tool)
    • In “Replacement”: - “in column”: c118 - “Find pattern”: +
    • Once finished, rename the file into protein groups input for MSstats
  4. Replace Text ( Galaxy version 1.1.3) with the following parameters:
    • param-file “File to process”: select evidence (output of Select tool)
    • In “Replacement”: - “in column”: c54 - “Find pattern”: +
    • Once finished, rename the file into evidence input for MSstats
  5. MSstats ( Galaxy version 4.0.0.0) with the following parameters:
    • “input source”: MaxQuant
      • param-file “evidence.txt - feature-level data”: evidence input for MSstats (output of Replace tool)
      • param-file “proteinGroups.txt - protein-level data”: protein groups input for MSstats (output of Replace tool)
      • param-file “annotation file”: annotation file
      • “Select Protein ID in evidence.txt”: Leading razor protein column
      • In “MaxQtoMSstatsFormat Options”:
        • “Remove the proteins which have only 1 peptide and charge”: Yes
    • In “dataProcess Options”:
      • “Select outputs”: MSstats log MSstats FeatureLevelData ProteinLevelData Group Quantification Matrix Table Sample Quantification Matrix Table
      • In “dataProcess Plot Options”:
        • “Select visualization outputs”: QCPlot
        • “Select protein IDs to draw plots”: Option for QC plot: "allonly" will generate one QC plot with all proteins
        • In “Advanced visualization parameters”:
          • “Angle of labels represented each condition at the top of graph”: 0
    • “Compare Groups”: Yes
      • param-file “Comparison Matrix”: comparison_matrix
      • “Select outputs”: MSstats ComparisonResult
      • In “Comparison Visualization Options”:
        • “Select visualization outputs”: MSstats VolcanoPlot
        • In “Advanced visualization parameters”:
          • “Involve fold change cutoff or not for volcano plot or heatmap”: 1.5
          • “Display protein names in Volcano Plot”: No
Question
  1. How many proteins were removed as potential non human contaminants?
  2. How many proteins were included into the statistical analysis?
  1. 28 (2622 lines in protein group file minus 2594 lines after select)
  2. 1764 (MSstats log)

More details on MSstats

MSstats is designed for statistical modelling of mass spectrometry based proteomic data Choi et al. 2014. Proteomic data analysis requires statistical approaches that reduce bias and inefficiencies and distinguish systematic variation from random artifacts Käll and Vitek 2011.

MSstats is directly compatible with the output of several quantitative proteomics software. In addition to the results of the proteomics software an annotation file is needed as input. The annotation file describes the experimental design such as the conditions, biological and technical replicates. To be compatible with MaxQuant results, an additional column with the label type is needed, which only contains L (light) in DDA experiments. A wrong setup of the annotation file is the most common source of errors in MSstats, thus we collected more information in the box below to allow you to adjust the annotation file when analyzing your own experiments.

For label-free MaxQuant data, the annotation file should have 5 columns with exactly these headers: Raw.file, Condition, BioReplicate, Run; IsotopeLabelType

  1. Raw.file: The names must match exactly to the file names in the MaxQuant evidence.txt “Raw file” column. (e.g. “file1.raw.thermo”).
  2. Condition: The conditions which will be compared in the statistical modelling. They are not allowed to start with a number or contain any special characters except for ‘_’.
  3. BioReplicate: This column should contain a unique identifier for each biological replicate in the experiment. For example, in a clinical proteomic investigation this should be a unique patient id. If technical replicates are present, all samples from the same biological replicate should have the same id but different run ids. MSstats automatically detects the presence of technical replicates and accounts for them in the model-based analysis.
  4. Run: This column contains the identifier of a mass spectrometry run. Each mass spectrometry run should have a unique identifier (number or name), regardless of the origin of the biological sample.
  5. IsotopeLabelType: This is L (light) for all MaxQuant DDA experiments.

MSstats will compare all conditions that are indicated in the comparison matrix. The comparison matrix has to be setup correctly to avoid errors and wrong statistical modelling. It contains a first column that gives the comparisons a name and one column per condition. This matrix is filled with 1 and -1 to specify the conditions that are compared in each comparison and with 0 for conditions that are not part of the comparison.

  1. The first column of the comparison matrix contains the names of the comparisons and should have ‘names’ as header. These names will be used in all MSstats output files, therefore it is important that the names are meaningful and reflect the actual comparison (see below)
  2. An additional column for each condition that is present in the data. This means each condition present in the annotation file has to be a separate column even when the condition will not be used for any comparison. The header should contain the condition name exactly as written in the annotation file.
  3. Fill the matrix: Use 1 and -1 to indicate the conditions to compare and 0 for conditions that are not compared. Multiple groups can be combined by using 0.5.
  4. Example: to compare condition1 with condition2: write 1 into the column of condition1 and write -1 in the column of condition2. In the first column of the matrix name this comparison condition1-condition2. The naming of the comparison should reflect the direction of the comparison and thus always have the condition that is set to 1 first and the condition that is set to -1 second (condition1-condition2 and NOT condition2-condition1).

The first analysis step in MSstats is the conversion of the input data into an MSstats compatible table. For this step several parameters to filter and adjust the input data can be selected. We keep the default parameters and only change one parameter in order to remove proteins which have only a single peptide measurement. Next, data processing optimizes the data for statistical modelling via log-transformation, and normalization of intensities, feature selection, missing value imputation, and run-level summarization. Log- transformation is performed to transform multiplicative signals to additive signals which are compatible with linear statistical models and bring the intensity distribution close to a normal distribution. Furthermore, it changes the dependence of variances from the intensity values: in the raw data larger intensities have larger variances but after log transformation lower intensities have larger variances.

Normalization aims to make the intensities of different runs more comparable to each other. The default normalization method, equalize medians, assumes that the majority of proteins do not change across runs and shifts all intensities of a run by a constant to obtain equal median intensities across runs. A feature in label-free DDA data corresponds to a peptide at a given charge state (m/z value), resulting from the identification of MS2 spectra combined with the quantitative information from the MS1 scans. Feature selection allows the use of either all, only the most abundant features or only high quality peptides for protein summarization.

Missing values and noisy features with outliers are typical in label-free DDA datasets but influence protein summarization. Therefore, it is recommended to perform missing value imputation. Missing values are reported differently in different Softwares. MaxQuant reports them as NA and MSstats assumes that missing intensity values from MaxQuant mean that the intensity was below the limit of quantification. This means the values are not missing for random but for the reason of low abundance. Therefore, the values are only partially known and called “censored”. This may also be the case for very low intensity values, which might not be reliable. The percentile that is not trusted and should be considered a censored value is defined via the “Maximum quantile for deciding censored missing values” parameter. Censored values are replaced by an intensity that is generated via an accelerated failure time model (AFT). Alternatively censored values may be replaced by the minimum value of the features, runs or both as defined in the “Cutoff value for censoring”. Runs with no intensity measurement for a protein will be removed for any further calculation on this protein.

Protein summarization is by default performed via Tukey’s median polish for robust parameter estimation with median across rows and columns. Run-level summaries are later used for statistical group comparison.

Any two groups can be compared to find differentially abundant proteins between them. MSstats uses a family of linear mixed models that are automatically adjusted for the comparison type according to the information in the annotation file, such as conditions, biological and technical replicates and runs. This allows comparison of groups with different sizes; comparison of the mean of some groups, paired designs and time course experiments.

Follow up on MSstats results

We obtain several output files from MSstats. MSstats log file contains the MSstats report with warnings and information about the analysis steps. The MSstats QCPlot visualizes the log transformed intensities for all proteins and runs. The volcano plot plots the statistical result as transformed p-values vs. the log2 fold change. A fold change of 1.5 means that a protein is 50% more abundand in one condition than the other. The log2 fold change is 0.58.

qc plot. Open image in new tab

Figure 1: QC plot of all proteins
volcano plot. Open image in new tab

Figure 2: Volcano plot showing p-values and log2 fold changes for all proteins. Dashed line indicates p-value of 0.05 and log2 fold change of ± 0.58

The FeatureLevelData file contains the transformed, normalized and imputed intensities for each peptide in each run. ProteinLevelData data summarizes intensities per run and protein. We’ll count and visualize the number of features per run and calculate the distribution of proteins per sample.

Hands-on: Follow up on MSstats results
  1. Summary Statistics with the following parameters:
    • param-file “Summary statistics on”: ProteinLevelData (output of MSstats tool)
    • “Column or expression”: c8
  2. Datamash ( Galaxy version 1.0.6) with the following parameters:
    • param-file “Input tabular dataset”: ProteinLevelData (output of MSstats tool)
    • “Group by fields”: 4
    • “Input file has a header line”: Yes
    • “Print header line”: Yes
    • “Sort input”: Yes
    • In “Operation to perform on each group”:
      • param-repeat “Insert Operation to perform on each group”
        • “On column”: c1
  3. Click on galaxy-barchart “Visualize this data” on the Datamash tool result.
    • Select Bar diagram (NVD3)
    • “Provide a title”: Number of features per sample
    • Click Select data galaxy-chart-select-data
    • “Data point labels”: Column: 1
    • Save galaxy-save (file is saved under “User” –> “Visualizations”)
Question
  1. Which sample has the lowest amount of proteins after protein summarization?
  2. In the complete experiment, how many features has a protein on average?
  1. RDEB cSCC4
Number of proteins per sample. Open image in new tab

Figure 3: Number of proteins per sample (run)
  1. Around 6 features per protein (mean in summary statistics).

Filtering MSstats results

The comparison result table summarizes the statistical results per protein and comparison. First, we keep only the Uniprot ID in column 1 to make the ID less cluttered. This is done by deleting everything before the first pipe ‘|’ and everything after the second pipe ‘|’. Then we keep only statistically significant proteins that means they have an adjusted p-value below 0.05. Next, we separate up- and down-regulated proteins by filtering for a positive and negative log2FC. The Sample Quantification Matrix Table contains the summarized intensities per protein and sample. In order to make its IDs compatible with the ones from the comparison result at a later step, we keep only the Uniprot ID as well.

Hands-on: Filtering MSstats results
  1. Replace Text ( Galaxy version 1.1.3) with the following parameters:
    • param-file “File to process”: Comparison Result (output of MSstats tool)
    • In “Replacement”: - “in column”: c1 - “Find pattern”: sp\|
      • param-repeat “Insert Replacement”
        • “in column”: c1
        • “Find pattern”: \|.*
  2. Filter with the following parameters:
    • param-file “Filter”: replaced comparison result (output of Replace Text tool)
    • “With following condition”: c8<0.05
    • “Number of header lines to skip”: 1
    • Once finished, rename the file into significant proteins
  3. Filter with the following parameters:
    • param-file “Filter”: significant proteins (output of Filter tool)
    • “With following condition”: c3>0.58
    • “Number of header lines to skip”: 1
  4. Add a tag #metastasized to the filtered file and rename it into metastasized filtered

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

  5. Filter with the following parameters:
    • param-file “Filter”: significant proteins (output of the first Filter tool)
    • “With following condition”: c3<-0.58
    • “Number of header lines to skip”: 1
  6. Add a tag #rdeb to the filtered file and rename it into rdeb filtered
Question
  1. Why do we filter for the adjusted p-value?
  2. How many proteins have a p-value below 0.05?
  1. Adjusted p-values control for the multiplicity of testing. Since we fit a separate model, and conduct a separate comparison for each protein, the number of tests equals the number of comparisons. A 0.05 cutoff of an adjusted p-value controls the False Discovery Rate in the collection of tests over all the proteins at 5%. Since they account for the multiplicity, adjusted p-values are more conservative (i.e. it is more difficult to detect a change).
  2. 148 in total (first filtering step); 133 are upregulated in metastasized cSCC (metastasized filtered) and 13 are upregulated in RDEB cSCC (rdeb filtered).

Finding differentially abundant proteins

For each condition we select only the significant proteins, which are proteins with a p-value above 0 and below 0.05. Proteins with a p-value of 0 are missing in one condition and are therfore discarded in the next steps. We’ll keep only the column with the Uniprot ID and extract the average protein intensities per sample from the sample quantification matrix file and vizualize them at heatmap. We do the exact same steps for both conditions, therefore, each time you start a tool (except for the replace step) you can use the multiple input file to start the step for the metastasized and rdeb files at the same time.

Hands-on: filter differentially abundant proteins
  1. Filter with the following parameters:
    • param-file “Filter”: metastasized filtered (output of Filter tool)
    • “With following condition”: c8>0
    • “Number of header lines to skip”: 1
    • Rename the file into significant metastasized
  2. Cut with the following parameters:
    • “Cut columns”: c1
    • param-file “From”: significant metastasized (output of last Filter tool)
  3. Filter with the following parameters:
    • param-file “Filter”: rdeb filtered (output of Filter tool)
    • “With following condition”: c8>0
    • “Number of header lines to skip”: 1
    • Rename the file into significant rdeb
  4. Cut with the following parameters:
    • “Cut columns”: c1
    • param-file “From”: significant rdeb (output of last Filter tool)
  5. Replace Text ( Galaxy version 1.1.3) with the following parameters:
    • param-file “File to process”: Sample Quantification Matrix Table (output of MSstats tool)
    • In “Replacement”: - “in column”: c1 - “Find pattern”: sp\|
      • param-repeat “Insert Replacement”
        • “in column”: c1
        • “Find pattern”: \|.*
  6. Join ( Galaxy version 1.1.2) with the following parameters:
    • param-file “1st file”: replaced sample quantification matrix (output of Replace text tool)
    • “Column to use from 1st file”: c1
    • param-file “2nd File”: metastasized cut (output of Cut tool)
    • “Column to use from 2nd file”: c1
    • “First line is a header line”: Yes
  7. Join ( Galaxy version 1.1.2) with the following parameters:
    • param-file “1st file”: replaced sample quantification matrix (output of Replace Text tool)
    • “Column to use from 1st file”: c1
    • param-file “2nd File”: rdeb cut (output of Cut tool)
    • “Column to use from 2nd file”: c1
    • “First line is a header line”: Yes
  8. Heatmap2 ( Galaxy version 3.0.1) with the following parameters:
    • param-file “Input should have column headers - these will be the columns that are plotted”: metastasized join (output of Join tool)
    • “Plot title”: Upregulated proteins in metastasized cSCC
    • “Enable data clustering”: No
    • “Data scaling”: Scale my data by row
  9. Heatmap2 ( Galaxy version 3.0.1) with the following parameters:
    • param-file “Input should have column headers - these will be the columns that are plotted”: rdeb join (output of Join tool)
    • “Plot title”: Upregulated proteins in RDEB cSCC
    • “Enable data clustering”: No
    • “Data scaling”: Scale my data by row
Question
  1. How many proteins are differentially abundant?
  1. 85 are upregulated in metastasized cSCC and 12 are upregulated in RDEB cSCC (number of lines minus 1 in the first filtering step per condition).

Follow up on differentially abundant proteins

In addition we retrieve for each Uniprot ID the corresponding protein names from uniprot to allow an easier interpretation.

Hands-on: MSstats visualizations
  1. UniProt ID mapping and retrieval ( Galaxy version 0.5) with the following parameters:
    • param-file “Input file with IDs”: metastasized join (output of Join tool)
    • “ID column”: c1
    • “Do you want to map IDs or retrieve data from UniProt”: Retrieve: request entries by uniprot accession using batch retrieval
  2. UniProt ID mapping and retrieval ( Galaxy version 0.5) with the following parameters:
    • param-file “Input file with IDs”: rdeb join (output of Join tool)
    • “ID column”: c1
    • “Do you want to map IDs or retrieve data from UniProt”: Retrieve: request entries by uniprot accession using batch retrieval
  3. FASTA-to-Tabular ( Galaxy version 1.1.1) with the following parameters:
    • param-file “Convert these sequences”: metastasized uniprot (output of UniProt tool)
    • “How many columns to divide title string into?”: 2
  4. FASTA-to-Tabular ( Galaxy version 1.1.1) with the following parameters:
    • param-file “Convert these sequences”: rdeb uniprot (output of UniProt tool)
    • “How many columns to divide title string into?”: 2

Three of the differentially abundant proteins found here were also found and stained with antibodies in the original publication: Collagen XIV which is higher in RDEB cSCC than in metastasizing cSCC and Serum amyloid P-component as well as X-ray repair cross-complementing protein 6, which are both higher in metastasizing cSCC than in RDEB cSCC. Collagen XIV is a fibril associated collagen which may have tissue stabilizing function in the dermis. The upregulation of collagen XIV as well as other collagens in RDEB cSCC could be a compensation effort for the impaired collagen VII in RDEB tissues. Collagen VII is indeed only found in some RDEB samples and with such low intensities that it appears as upregulated in metastasized cSCC.

col14 staining. Open image in new tab

Figure 4: Immunoflourescence staining of collagen XIV in RDEB and metastasizing cSCC skin tissues