Marine Omics identifying biosynthetic gene clusters
Author(s) | Marie Josse |
OverviewQuestions:Objectives:
Which biological questions are addressed by the tutorial?
Which bioinformatics techniques are important to know for this type of data?
Requirements:
Follow a marine omics analysis
Learn to conduct a secondary metabolite biosynthetic gene cluster (SMBGC) Annotation
Discover SanntiS a tool for identifying BGCs in genomic & metagenomic data
Manage fasta files
Time estimation: 3 hoursSupporting Materials:Published: Aug 19, 2024Last modification: Sep 16, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00450version Revision: 2
Introduction
Through this tutorial, you will learn in the first part how to produce a protein fasta file from a nucleotide fasta file using Prodigal (it is a tool that predicts protein-coding genes from DNA sequences).
Then, you’ll be using InterProscan to create a tabular. Interproscan is a batch tool to query the InterPro database. It helps identify and predict the functions of proteins by comparing them to known databases.
And finally, you will discover SanntiS both to build genbank and especially to detect and annotate biosynthetic gene clusters (BGCs).
AgendaIn this tutorial, we will cover:
Get data
The FASTA file used in this example is, intentionally, a very small fraction of a genome (spanning exactly 1 BGC), and it’s main purpose is to quickly check that all the pieces of the workflow are working.
Hands-on: Data Upload
Create a new history for this tutorial and give it a name (for example “Marine Omics: SMBGC annotation”) for you to find it again later if needed.
To create a new history simply click the new-history icon at the top of the history panel:
Import the file with this link
https://figshare.com/ndownloader/files/48574534
and name it BGC0001472.fna
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename the datasets BGC0001472.fna
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Hands-on: Choose Your Own TutorialThis is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
Do you want to run the workflow or to discover the tools one by one ?
Import and launch the workflow
Hands-on: Import the workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Option 1: use the URL - Click on galaxy-upload Import at the top-right of the screen - Paste the URL of the workflow into the box labelled “Archived Workflow URL”
https://earth-system.usegalaxy.eu/u/marie.josse/w/marine-omics-identifying-biosynthetic-gene-clusters
- Option 2: use the workflow name - Click on Public workflows at the top-right of the screen
- Search for Marine Omics identifying biosynthetic gene clusters
- In the workflow preview box click on galaxy-upload Import
- Click the Import workflow button
Hands-on: Run the workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the workflow-run (Run workflow) button next to your workflow
- /!\ Select Yes for Workflow semi automatic
- Configure the workflow as needed with the 2 datasets you uploaded right before (BGC0001472.fna)
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
Now you don’t have to do anything else. You should see all the different steps of the workflow appear in your history. When the workflow is fully completed you should have the following history.
Workflows are a powerful Galaxy feature that allows you to scale up your analysis by performing an end-to-end analysis with a single click of a button. In order to keep provenance of the workflow invocation (an invocation of a workflow means one run or execution of the workflow) it can be exported from Galaxy in the form of a Workflow Run Crate RO-Crate profile.
After the workflow has completed, we can export the RO-Crate. The crate does not appear in your history, but can be accessed from the User -> Workflow Invocations menu on the top bar.
Hands-on: Extract a RO-Crate
- In the top right of your history, go to galaxy-history-options -> Show Invocations
Our latest workflow run should be listed at the top.
- Click on it to expand it:
- Click on the Export tab in the expanded view of the workflow invocation. You should see a page that contains three download options: - Research Object Crate (RO-Crate) - BioCompute Object - File
- Click on the Generate galaxy-download option of the RO-Crate box (1st box)
Great work! You have created a Workflow Run Crate. This makes it easy to track the provenance of the executed workflow.
Prodigal Gene Predictor: generate a protein fasta file
Prodigal Gene Predictor
Prodigal is a tool that predicts protein-coding genes from DNA sequences. It takes a nucleotide FASTA file as input and identifies regions that are likely to code for proteins. The output is a protein FASTA file where each sequence represents a predicted protein. Some of these sequences end with an asterisk (*), which marks the end of a complete protein sequence identified by Prodigal. This asterisk is added when Prodigal detects a full protein-coding region that ends with a stop codon. Sequences without an asterisk either represent partial proteins or do not end in a typical stop codon.
Hands-on: Run prodigal
- Prodigal Gene Predictor ( Galaxy version 2.6.3+galaxy0) with the following parameters:
- param-file “Specify input file”:
BGC0001472.fna
(Input the nucleotide fasta file)- “Specify mode”:
Meta : Anonymous sequences, analyze using preset training files, ideal for metagenomic data or single short sequences
You don’t need to change any other parameters leave them on the default input.
- Click on Run Tool
You should have 4 new outputs appearing in your history. In these outputs you should have Prodigal Gene Predictor on data 1 : protein translations file. You can click on it and then click on the galaxy-eye (eye).
You can notice here that at each end of the sequence there’s a *. Later on we will need to remove this star. But, first we are going to use this protein file to build the Genbank that SanntiS need to make a SMBGC annotation.
SanntiS for building a Genbank file
This step combines the original nucleotide sequences with the cleaned protein sequences to create a GenBank format file. This format is widely used for storing and organising information about DNA sequences and their annotations. In this step, the DNA sequences and their corresponding coding regions are transformed into a format that is suitable for SanntiS.
Hands-on: Build Genbank
- SanntiS biosynthetic gene clusters ( Galaxy version 0.9.3.5+galaxy1) with the following parameters:
- “Do you want to build a genbank or to make a SMBGC Annotation?”:
Build genbank
- param-file “Input a nucleotide fasta file”:
BGC0001472.fna
(Input the nucleotide fasta file)- param-file “Input a protein fasta file”:
Prodigal Gene Predictor on data 1 : protein translations file
(output of Prodigal Gene Predictor tool)- Click on Run Tool
Regex Find And Replace
Remember earlier we noticed the star * in the protein fasta file ?
Now is the time to remove it !
The asterisks at the end of some protein sequences are informative but can cause issues with some analysis tools. In this step, we remove these asterisks to produce a clean protein FASTA file, making it ready for further analysis.
Hands-on: Remove *
- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
Prodigal Gene Predictor on data 1 : protein translations file
(output of Prodigal Gene Predictor tool)- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
\*$
- “Replacement”: `` (leave an empty box there)
- Click on Run Tool
Check if the * were well removed.
InterProScan
InterProScan is a tool that helps identify and predict the functions of proteins by comparing them to known databases. This tool analyses the protein sequences and produces an output file containing detailed information about the possible functions, domains, and families associated with these proteins.
Hands-on: Task description
- InterProScan ( Galaxy version 5.59-91.0+galaxy3) with the following parameters:
- param-file “Protein FASTA File”:
Regex Find And Replace on data **
(output of Regex Find And Replace tool)- “Use applications with restricted license, only for non-commercial use?”:
No
You can leave all the other parameters on the default input.- Click on Run Tool
SanntiS for annotating biosynthetic gene clusters
SanntiS is a tool specifically designed to detect and annotate biosynthetic gene clusters (BGCs). It uses neural networks trained on InterPro signatures to achieve high accuracy in identifying BGCs in both genomic and metagenomic datasets 1. A significant benefit of using SanntiS is that your results will be comparable with a large number of datasets, including over 5,000 marine metagenomic assemblies archived in the MGnify resource. This tool provides valuable insights into the biosynthetic potential of organisms or environmental samples. The final output is a GFF file containing detailed BGC annotations, which can be used for further analyses or applications.
Hands-on: Identify biosynthetic gene clusters
- SanntiS biosynthetic gene clusters ( Galaxy version 0.9.3.5+galaxy1) with the following parameters:
- “Do you want to build a genbank or to make a SMBGC Annotation?”:
Run SanntiS
- param-file “Input the tabular file from InterProScan”:
InterProScan on data **
(output of InterProScan tool)- param-file “Input a Genbank file”:
SanntiS output data genbank
(output of SanntiS biosynthetic gene clusters tool)- Click on Run Tool
Finally, you should have one gff3 file in your history under SanntiS output data
Conclusion
Here you now know how to conduct a Marine Omics analysis
Extra information
Coming up soon even more tutorials on and other Earth-System related trainings. Keep an galaxy-eye open if you are interested!