Peptide Library Data Analysis
OverviewQuestions:Objectives:
How to utilize quantitative properties of amino acids and peptide sequence to analyse peptide data?
Requirements:
Calculate descriptors
Quantitative analysis of peptide sequence properties
Time estimation: 20 minutesLevel: Intermediate IntermediateSupporting Materials:Published: Jan 22, 2021Last modification: Nov 9, 2023License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00227rating Rating: 4.0 (0 recent ratings, 1 all time)version Revision: 5
Several computational methods have been proven very useful in the initial screening and prediction of peptides for various biological properties. These methods have emerged as effective alternatives to the lengthy and expensive traditional experimental approaches. Properties associated with a group of peptide sequences such as overall charge, hydrophobicity profile, or k-mer composition can be utilized to compare peptide sequences and libraries. In this tutorial, we will be discussing how peptide-based properties like charge, hydrophobicity, the composition of amino acids, etc. can be utilized to analyze the biological properties of peptides. Additionally, we will learn how to use different utilities of the Peptide Design and Analysis Under Galaxy (PDAUG) package to calculate various peptide-based descriptors, and use these descriptors and feature spaces to build informative plots.
Easy access to tools, workflows and data from the docker image
An easy way to install and use the PDAUG toolset, and follow this tutorial is via a prebuilt docker image equipped with a PDAUG toolset, workflow, and data library. A prebuilds docker image can be downloaded and run by typing a simple command at the terminal after installing docker software on any operating system.
Hands-on: Easy access of tools, workflows and data from docker image
- Downloading the docker image from the docker hub using
docker pull jayadevjoshi12/galaxy_pdaug:latest
command.- Running the container with latest PDAUG tools
docker run -i -t -p 8080:80 jayadevjoshi12/galaxy_pdaug:latest
.- Workflow is available under the workflow section, use
admin
as username andpassword
as a password to login as an administrator of your galaxy instance.- Use
admin
as username andpassword
as a password to login galaxy instance, which is available at localhost to access workflow and data.
AgendaIn this tutorial, we will cover:
Peptide Data
Several inbuilt data sets have been provided with the toolPDAUG Peptide Data Access
. The antimicrobial peptides (AMPs) versus transmembrane peptides (TMPs) dataset was used as an example data set to understand the overall relation between features and biological properties of peptides. AMPs consist of an intersection of all activity annotations of the APD2 and CAMP databases, where gram-positive, gram-negative, and antifungal exact matches were observed. TMPs were extracted from alpha-helical transmembrane regions of proteins for classification.
Hands-on: Fetching inbuild data
- PDAUG Peptide Data Access ( Galaxy version 0.1.0) with the following parameters:
- “Datasets”:
AMPvsTMP
Converting tabular data into fasta format
PDAUG Peptide Data Access
tool returns data as a tabular file that contains sequences from both the classes. In order to utilize this data in the next steps, first we need to convert tabular data into fasta format. If data contains sequences from two different classes PDAUG TSVtoFASTA
tool converts and splits data into two separate files for each of the class, AMPs, and TMPs. The reason behind converting and splitting the data is that all the downstream tools require two separate files if we are comparing two different peptide classes or calculating features.
Hands-on: Converting tabular data into fasta formate
- PDAUG TSVtoFASTA ( Galaxy version 0.1.0) with the following parameters:
- param-file “Input file”:
PDAUG Peptide Data Access - AMPvsTMP (tabular)
(output of PDAUG Peptide Data Access tool)- “Peptide Column”:
name
- “Method to convert data”:
Split Data By Class Label
- “Column with the class label”:
class label
Analyzing peptide libraries (AMPs and TMPs) based on features and feature space
Summary Plot for peptide libraries
In this step, we utilize PDAUG Peptide Sequence Analysis
tool to compare peptide sequences based on hydrophobicity, hydrophobic movement, charge, amino acid fraction, and sequence length and create a summary plot.
Hands-on: Generating a summary plot to assess peptide dataset
- PDAUG Peptide Sequence Analysis ( Galaxy version 0.1.0) with the following parameters:
- “Analysis options”:
Plot Summary
- param-file “First input file”:
PDAUG TSVtoFASTA on data 1 - first (fasta)
(first output of PDAUG TSVtoFASTA tool)- param-file “Second input file”:
PDAUG TSVtoFASTA on data 1 - second (fasta)
(second output of PDAUG TSVtoFASTA tool)- “first input file”:
TMPs
- “Second input file”:
AMPs
QuestionWhat can be concluded from the summary plot based on different properties?
The summary plot represents differences between two sets of peptides based on an amino acid fraction, global charge, sequence length, global hydrophobicity, glocal hydrophobic movement. Additionally, 3D scattered plot shows the clustering of peptides based on three features.
- Leucine and Valine show relatively higher differences in terms of their fraction within both groups.
- TMPs show a global charge in the range of 0-5 in comparison to AMPs which show a global charge in a range of 0-14.
- AMPs show higher variability in terms of their length, global hydrophobic movement, and hydrophobicity in comparison to TMPs.
- Hydrophobic properties are important in determining transmembrane properties of proteins and peptides which is evident with this summary plot.
- Clustering of two different kinds of peptides can be observed with a 3D scattered plot based on their properties, however, we can also observe a few peptides with overlapping feature space.
Assessing feature space distribution
In this tool, we have used PDAUG Fisher's Plot
that compares two peptide libraries based on the feature space using the Fisher test.
Hands-on: Generating a Fisher's plot to assess peptide dataset
- PDAUG Fisher’s Plot ( Galaxy version 0.1.0) with the following parameters:
- param-file “First fasta file”:
PDAUG TSVtoFASTA on data 1 - first (fasta)
(first output of PDAUG TSVtoFASTA tool)- param-file “Second fasta file”:
PDAUG TSVtoFASTA on data 1 - second (fasta)
(second output of PDAUG TSVtoFASTA tool)- “Label for first population”:
TMPs
- “Label for second population”:
AMPs
QuestionWhat does Fisher’s plot represents?
Fisher’s plot represents the difference between two groups of peptides based on their feature space. Each tiny square in this plot represents the feature space. Based on the sliding window Fisher’s test was performed for each feature space to assess the presence of peptides from two different groups on each of the tiny squares. The AMPs and TMPs in the feature space represented by their mean hydropathy and amino acid volume. Fisher’s plot shows that the sequences with larger hydrophobic amino acids are more frequent in TMPs in comparison to AMPs.
The AMPs and TMPs in the feature space represented by their mean hydropathy and amino acid volume. Fisher’s plot shows that the sequences with larger hydrophobic amino acids are more frequent in TMPs in comparison to AMPs.
Assessing the relation between peptide features by 3D scatter plot
Calculating Sequence Property-Based Descriptors
In this step we will calculate Composition, Transition and Distribution (CTD) descriptos. Composition describptors are defined as the number of amino acids of a particular property divided by total number of amino acids. Transition descriptors are representd as the number of transition from a particular property to different property divided by (total number of amino acids − 1). Distribution descriptors are derived by chain length and the amino acids of a particular property located on this length Govindan and Nair 2013.
Hands-on: Calculating descriptors for the peptide dataset
- PDAUG Sequence Property Based Descriptors ( Galaxy version 0.1.0) with the following parameters:
- param-file “Input fasta file”:
PDAUG TSVtoFASTA on data 1 - first (fasta)
(first output of PDAUG TSVtoFASTA tool)- “Descriptor Type”:
CTD
- PDAUG Sequence Property Based Descriptors ( Galaxy version 0.1.0) with the following parameters:
- param-file “Input fasta file”:
PDAUG TSVtoFASTA on data 1 - second (fasta)
(second output of PDAUG TSVtoFASTA tool)- “Descriptor Type”:
CTD
Adding the Class Label in both AMPs and TMPs
Class labels or target labels usually represents the class of peptides. Here in our data set, we have peptides, either as AMP or TMP. Since we have two classes we can represent these two classes with their actual labels AMPs and TMPs.
- Adding Class Label (target labels) in AMPs and TMPs data
Hands-on: Adding Class Labels (target labels) to the tabular data
- PDAUG Add Class Label ( Galaxy version 0.1.0) with the following parameters:
- param-file “Input file”:
PDAUG Sequence Property Based Descriptors on data 2 - CTD (tabular)
(output of PDAUG Sequence Property Based Descriptors tool)- “Class Label”:
TMPs
- PDAUG Add Class Label ( Galaxy version 0.1.0) with the following parameters:
- param-file “Input file”:
PDAUG Sequence Property Based Descriptors on data 3 - CTD (tabular)
(output of PDAUG Sequence Property Based Descriptors tool)- “Class Label”:
AMPs
Merging the two tabular data files
We utilize PDAUG Merge Dataframes
to merge two tabular data files.
Hands-on: Merging two tabular data files
- PDAUG Merge Dataframes ( Galaxy version 0.1.0) with the following parameters:
- param-files “Input files”:
PDAUG Add Class Label on data 6 - (tabular)
(output of PDAUG Add Class Label tool),PDAUG Add Class Label on data 7 - (tabular)
(output of PDAUG Add Class Label tool)- “Option to merg data”:
Merge data without adding class label
Plotting CTD descriptor data as Scatter plot
Tool PDAUG Basic Plots
will be used to compare two peptide libraries based on three CTD descriptors SecondaryStrD1100
, SolventAccessibilityD2001
, and NormalizedVDWVD3050
respectively. A 3D scatter plot will be generated.
Hands-on: Generating a scatter plot to assess features
- PDAUG Basic Plots ( Galaxy version 0.1.0) with the following parameters:
- “Data plotting method”:
Scatter Plot
- param-file “Input file”:
PDAUG Merge Dataframes on data 9 and data 8 - (tabular)
(output of PDAUG Merge Dataframes tool)- “Scatter Plot type”:
3D
- “First feature”:
_SecondaryStrD1100
- “Second feature”:
_SolventAccessibilityD2001
- “Third feature”:
_NormalizedVDWVD3050
- “Class label column”:
Class_label
Figure 3 Represent 3D scattered plot generated based on the CTD descriptors. Red dots represent TMPs and blue dots represent AMPs. Based on these 3 features, we can observe that both groups do not show any clear separation or cluster in the 3D space.
In this tutorial, we learned how to utilize inbuild data, calculate features, and utilize descriptors or features to assess biological properties. We also learned how to utilize various utilities of PDAUG
to generate useful plots to include in our peptide research.
Conclusion
In this tutorial, we learned an example flexible and extensible analysis of peptide data using PDAUG tools. We generated various plots based on the quantitative properties of amino acids and peptide sequences.