Chloroplast genome assembly
OverviewQuestions:Objectives:
How can we assemble a chloroplast genome?
Requirements:
Assemble a chloroplast genome from long reads
Polish the assembly with short reads
Annotate the assembly and view
Map reads to the assembly and view
Time estimation: 2 hoursSupporting Materials:Published: Dec 4, 2020Last modification: Sep 18, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00030rating Rating: 4.0 (1 recent ratings, 29 all time)version Revision: 19
Introduction
What is genome assembly?
Genome assembly is the process of joining together DNA sequencing fragments into longer pieces, ideally up to chromosome lengths.The DNA fragments are produced by DNA sequencing machines, and are called “reads”. These are in lengths of about 150 nucleotides (base pairs), to up to a million+ nucleotides, depending on the sequencing technology used. Currently, most reads are from Illumina (short), PacBio (long) or Oxford Nanopore (long and extra-long).
It is difficult to assemble plant genomes as they are often large (for example, 3,000,000,000 base pairs), have many repeat regions (such as transposons), and may be polyploid. This tutorial shows genome assembly for a smaller data set - the plant chloroplast genome - a single circular chromosome which is typically about 160,000 base pairs. It is thought that the the chloroplast evolved from a cyanobacteria that was living in plant cells.
In this tutorial, we will use a subset of a real data set from sweet potato, from the paper Zhou et al. 2018. To find out more about each of the tools used here, see the tool panel page for a summary and links to more information.
AgendaIn this tutorial we will deal with:
Upload data
Let’s start with uploading the data.
Hands-on: Import the data
Create a new history for this tutorial and give it a proper name
To create a new history simply click the new-history icon at the top of the history panel:
- Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
- Type the new name
- Click on Save
- To cancel renaming, click the galaxy-undo “Cancel” button
If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:
- Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
- Type the new name
- Press Enter
Import from Zenodo or a data library (ask your instructor):
- FASTQ file with illumina reads:
sweet-potato-chloroplast-illumina-reduced.fastq
- FASTQ file with nanopore reads:
sweet-potato-chloroplast-nanopore-reduced.fastq
- Note: make sure to import the files with “reduced” in the names, not the ones with “tiny” in the names.
https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-illumina-reduced.fastq https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-nanopore-reduced.fastq
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- “Select history”: the history you want to import the data to (or create a new one)
- Click on Import
Check read quality
We will look at the quality of the nanopore reads.
Hands-on: Check read quality
- Nanoplot ( Galaxy version 1.28.2+galaxy1):
- “Select multifile mode”:
batch
- “Type of file to work on”:
fastq
- “files”: select the
nanopore FASTQ file
- View output:
- There are five output files.
- Look at the
HTML report
to learn about the read quality.
QuestionWhat summary statistics would be useful to look at?
This will depend on the aim of your analysis, but usually:
- Sequencing depth (the number of reads covering each base position; also called “coverage”). Higher depth is usually better, but at very high depths it may be better to subsample the reads, as errors can swamp the assembly graph.
- Sequencing quality (the quality score indicates probability of base call being correct). You may trim or filter reads on quality. Phred quality scores are logarithmic: phred quality 10 = 90% chance of base call being correct; phred quality 20 = 99% chance of base call being correct. More detail on Wikipedia.
- Read lengths (read lengths histogram, and reads lengths vs. quality plots). Your analysis or assembly may need reads of a certain length.
Optional further steps:
- Find out the quality of your reads using other tools such as fastp or FastQC.
- To visualize base quality using emoji you can also use FASTQE.
- Run FASTQE for the illumina reads. In the output, look at the mean values (the middle row)
- Repeat FASTQE for the nanopore reads. In the tool settings, increase the maximum read length to 30000.
- To learn more, see the Quality Control tutorial
Assemble reads
We will assemble the long nanopore reads.
Hands-on: Assemble reads
- Flye ( Galaxy version 2.6+galaxy0):
- “Input reads”:
sweet-potato-chloroplast-nanopore-reduced.fastq
- “Estimated genome size”:
160000
- Leave other settings as default
Re-name the
consensus
output file toflye-assembly.fasta
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field to
flye-assembly.fasta
- Click the Save button
- View output:
- There are five output files.
- Note: this tool is heuristic; your results may differ slightly from the results here, and if repeated.
- View the
log
file and scroll to the end to see how many contigs (fragments) were assembled and the length of the assembly.- View the
assembly_info
file to see contig names and lengths.
Hands-on: View the assembly
- Bandage Info ( Galaxy version 0.8.1+galaxy1)
- “Graphical Fragment Assembly”: the Flye output file
Graphical Fragment Assembly
(not the “assembly_graph” file)- Leave other settings as default
- Bandage Image ( Galaxy version 0.8.1+galaxy2)
- “Graphical Fragment Assembly”: the Flye output file
Graphical Fragment Assembly
(not the “assembly_graph” file)- “Node length labels”:
Yes
- Leave other settings as default
Your assembly graph may look like this:
Note: a newer version of the Flye assembly tool now resolves this assembly into a single circle.
QuestionWhat is your interpretation of this assembly graph?
One interpretation is that this represents the typical circular chloroplast structure: There is a long single-copy region (the node of around 78,000 bp), connected to the inverted repeat (a node of around 28,000 bp), connected to the short single-copy region (of around 11,000 bp). In the graph, each end loop is a single-copy region (either long or short) and the centre bar is the collapsed inverted repeat which should have about twice the sequencing depth.
Comment: Further Learning
- Repeat the Flye assembly with different parameters, and/or a filtered read set.
- You can also try repeating the Flye assembly with an earlier version of the tool, to see the difference it makes. In the tool panel for Flye, click on the ‘Versions’ button at the top to change.
- Try an alternative assembly tool, such as Canu or Unicycler.
Polish assembly
Short illumina reads are more accurate the nanopore reads. We will use them to correct errors in the nanopore assembly.
First, we will map the short reads to the assembly and create an alignment file.
Hands-on: Map reads
- Map with BWA-MEM ( Galaxy version 0.7.17.1):
- “Will you select a reference genome from your history”:
Use a genome from history
- “Use the following dataset as the reference sequence”:
flye-assembly.fasta
- “Algorithm for constructing the BWT index”:
Auto. Let BWA decide
- “Single or Paired-end reads”:
Single
- “Select fastq dataset”:
sweet-potato-illumina-reduced.fastq
- “Set read groups information?”:
Do not set
- “Select analysis mode”:
Simple Illumina mode
- Re-name output file:
- Re-name this file
illumina.bam
Next, we will compare the short reads to the assembly, and create a polished (corrected) assembly file.
Hands-on: Polish
- pilon ( Galaxy version 1.20.1):
- “Source for reference genome used for BAM alignments”:
Use a genome from history
- “Select a reference genome”:
flye-assembly.fasta
- “Type automatically determined by pilon”:
Yes
- “Input BAM file”:
illumina.bam
- “Variant calling mode”:
No
- “Create changes file”:
Yes
- View output:
- What is in the
changes
file?- Rename the fasta output to
polished-assembly.fasta
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field to
polished-assembly.fasta
- Click the Save button
- Fasta Statistics ( Galaxy version 1.0.1)
- Find and run the tool called “Fasta statistics” on both the original flye assembly and the polished version.
QuestionHow does the polished assembly compare to the unpolished assembly?
This will depend on the settings, but as an example: your polished assembly might be about 10-15 Kbp longer. Nanopore reads can have homopolymer deletions - a run of AAAA may be interpreted as AAA - so the more accurate illumina reads may correct these parts of the long-read assembly. In the changes file, there may be a lot of cases showing a supposed deletion (represented by a dot) being corrected to a base.
Optional further steps:
- Run a second round (or more) of Pilon polishing. Keep track of file naming; you will need to generate a new bam file first before each round of Pilon.
- Run an alternative polishing tool, such as Racon. This uses the long reads themselves to correct the long-read (Flye) assembly. It would be better to run this tool on the Flye assembly before running Pilon, rather than after Pilon.
Annotate the assembly
We can now annotate our assembled genome with information about genomic features.
- A chloroplast genome annotation tool is not yet available in Galaxy; for an approximation, here we can use the tool for bacterial genome annotation, Prokka.
Hands-on: Annotate with Prokka
- Prokka ( Galaxy version 1.14.5+galaxy0) with the following parameters (leave everything else unchanged)
- param-file “contigs to annotate”:
polished-assembly.fasta
- View output:
- The GFF and GBK files contain all of the information about the features annotated (in different formats.)
- The .txt file contains a summary of the number of features annotated.
- The .faa file contains the protein sequences of the genes annotated.
- The .ffn file contains the nucleotide sequences of the genes annotated.
Alternatively, you might want to use a web-based tool designed for chloroplast genomes.
- One option is the GeSeq tool, described here. Skip this step if you have already used Prokka above.
Hands-on: Annotate with GeSeq
- Download
polished-assembly.fasta
to your computer (click on the file in your history; then click on the disk icon).- In a new browser tab, go to Chlorobox where we will use the GeSeq tool (Tillich et al. 2017) to annotate our sequence.
- Upload the
fasta
file there. Information about how to use the tool is available on the page.- Once the annotation is completed, download the required files.
- In Galaxy, import the annotation
GFF3
file.
Now make a JBrowse file to view the annotations (the GFF3 file - produced from either Prokka or GeSeq) under the assembly (the polished-assembly.fasta file).
Hands-on: View annotations
- JBrowse genome browser ( Galaxy version 1.16.4+galaxy3):
- “Reference genome to display”:
Use a genome from history
- “Select a reference genome”:
polished-assembly.fasta
- “Produce Standalone Instance”:
Yes
- “Genetic Code”:
11. The Bacterial, Archaeal and Plant Plastid Code
- “JBrowse-in-Galaxy Action”:
New JBrowse instance
- “Insert Track Group”
- “Insert Annotation Track”
- “Track Type”:
GFF/GFF3/BED Features
- “GFF/GFF3/BED Track Data”: the
GFF3
file- Leave the other track features as default
- Re-name output file:
- JBrowse may take a few minutes to run. There is one output file: re-name it
view-annotations
- View output:
- Click on the eye icon to view the annotations file.
- Select the right contig to view, in the drop down box.
- Zoom out (with the minus button) until annotations are visible.
Here is an embedded snippet showing JBrowse and the annotations:
View reads
We will look at the original sequencing reads mapped to the genome assembly. In this tutorial, we will import very cut-down read sets so that they are easier to view.
Hands-on: Import cut-down read sets
- Import from Zenodo or a data library (ask your instructor):
- FASTQ file with illumina reads:
sweet-potato-chloroplast-illumina-tiny.fastq
- FASTQ file with nanopore reads:
sweet-potato-chloroplast-nanopore-tiny.fastq
- Note: these are the “tiny” files, not the “reduced” files we imported earlier.
https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-illumina-tiny.fastq https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-nanopore-tiny.fastq
Hands-on: Map the reads to the assembly
- Map the Illumina reads (the new “tiny” dataset) to the
polished-assembly.fasta
, the same way we did before, using bwa mem.- This creates one output file: re-name it
illumina-tiny.bam
- Map the Nanopore reads (the new “tiny” dataset) to the
polished-assembly.fasta
. The settings will be the same, exceptSelect analysis mode
should beNanopore
- This creates one output file: re-name it
nanopore-tiny.bam
Hands-on: Visualise mapped reads
- JBrowse genome browser ( Galaxy version 1.16.4+galaxy3):
- “Reference genome to display”:
Use a genome from history
- “Select a reference genome”:
polished-assembly.fasta
- “Produce Standalone Instance”:
Yes
- “Genetic Code”:
11. The Bacterial, Archaeal and Plant Plastid Code
- “JBrowse-in-Galaxy Action”:
New JBrowse instance
- “Insert Track Group”
- “Insert Annotation Track”
- “Track Type”:
BAM pileups
- “BAM track data”:
nanopore-tiny.bam
- “Autogenerate SNP track”:
No
- Leave the other track features as default
- “Insert Annotation Track”.
- “Track Type”:
BAM pileups
- “BAM track data”:
illumina-tiny.bam
- “Autogenerate SNP track”:
No
- Leave the other track features as default
- Re-name output file:
- JBrowse may take a few minutes to run. There is one output file: re-name it
assembly-and-reads
- View output:
- Click on the eye icon to view. (For more room, collapse Galaxy side menus with corner < > signs).
- Make sure the bam files are ticked in the left hand panel.
- Choose a contig in the drop down menu. Zoom in and out with + and - buttons.
Here is an embedded snippet showing JBrowse and the mapped reads:
Question
- What are the differences between the nanopore and the illumina reads?
- What are some reasons that the read coverage may vary across the reference genome?
- Nanopore reads are longer and have a higher error rate.
- There may be lots of reasons for varying read coverage. Some possibilities: In areas of high read coverage: this region may be a collapsed repeat. In areas of low or no coverage: this region may be difficult to sequence; or, this region may be a misassembly.
- To learn more about JBrowse and its features, see the Genomic Data Visualisation with JBrowse tutorial
Repeat with new data
Optional extension exercise
We can assemble another chloroplast genome using sequence data from a different plant species: the snow gum, Eucalyptus pauciflora. This data is from Wang et al. 2018. It is a subset of the original FASTQ read files (Illumina - SRR7153063, Nanopore - SRR7153095).
Hands-on: Assembly and annotation
- Get data: at this Zenodo link, then upload to Galaxy.
- Check reads: Run Nanoplot on the nanopore reads.
- Assemble: Use Flye to assemble the nanopore reads, then get Fasta statistics Note: this may take several hours.
- Polish assembly: Use Pilon to polish the assembly with short Illumina reads. Note: Don’t forget to map these Illumina reads to the assembly first using bwa-mem, then use the resulting
bam
file as input to Pilon.- Annotate: Use the GeSeq tool at Chlorobox or the Prokka tool within Galaxy.
- View annotations:Use JBrowse to view the assembled, annotated genome.