Genome assembly using PacBio data
Author(s) | Anthony Bretaudeau Alexandre Cormier Erwan Corre Laura Leroi Stéphanie Robin |
Reviewers |
OverviewQuestions:Objectives:
How to perform a genome assembly with PacBio data ?
How to check assembly quality ?
Requirements:
Assemble a Genome with PacBio data
Assess assembly quality
Time estimation: 6 hoursLevel: Intermediate IntermediateSupporting Materials:Published: Nov 29, 2021Last modification: Oct 15, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00033rating Rating: 4.0 (0 recent ratings, 2 all time)version Revision: 9
In this tutorial, we will assemble a genome of a species of fungi in the family Mucoraceae, Mucor mucedo, from PacBio sequencing data. These data were obtained from NCBI (SRR8534473, SRR8534474 and SRR8534475). The quality of the assembly obtained will be analyzed, in particular by comparing it to a reference assembly, obtained with Falcon assembler, and available on the JGI website.
AgendaIn this tutorial, we will cover:
Get data
We will use long reads sequencing data: CLR (continuous long reads) from PacBio sequencing of Mucor mucedo genome. This data is a subset of data from NCBI. We will also use later a reference genome assembly downloaded from the JGI website. This reference genome was assembled using the same PacBio data, we will use it as a comparison with our own assembly.
Get data from Zenodo
Hands-on: Data upload from Zenodo
- Create a new history for this tutorial
Import the files from Zenodo
https://zenodo.org/records/5702408/files/SRR8534473_subreads.fastq.gz https://zenodo.org/records/5702408/files/SRR8534474_subreads.fastq.gz https://zenodo.org/records/5702408/files/SRR8534475_subreads.fastq.gz
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
- Rename the datasets
Check that the datatype is
fastqsanger.gz
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Get data from JGI website
Hands-on: Data upload from JGI website
- Create a JGI account in registration page of JGI: JGI registration
- Sign in JGI Genome Portal JGI Genome Portal
- Genome assembly is available here: JGI Mucor mucedo
- Import fasta assembly file
Mucmuc1_AssemblyScaffolds.fasta
on your computer locally- Upload this file on Galaxy
Check that the datatype is
fasta
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Genome Assembly with Flye
We will use Flye, a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly. All informations about Flye assembler are here: Flye.
Hands-on: Assembly
- Flye ( Galaxy version 2.9+galaxy0) with the following parameters:
- param-file “Input reads”: the three sequencing datasets
- “Mode”:
PacBio raw
- “Number of polishing iterations”:
1
- “Reduced contig assembly coverage”:
Disable reduced coverage for initial disjointing assembly
The tool produces four datasets: consensus, assembly graph, graphical fragment assembly and assembly info
QuestionWhat are the different output datasets?
- The first dataset (consensus) is a fasta file containing the final assembly (1461 contigs). You may notice that the result (contigs number) you obtained is sligthy different from the one presented here. This is due to the Flye assembly algorithm which doesn’t always give the eact same results.
- The second and third dataset are assembly graph files. These graphs are used to represent the final assembly of a genome, they are based on reads and their overlap information. Some tools such as Bandage allow to visualize the assembly graph.
- The fourth dataset is a tabular file (assembly_info) containing extra information about contigs/scaffolds.
Quality assessment
Genome assembly metrics with Fasta Statistics
Fasta statistics displays the summary statistics for a fasta file. In the case of a genome assembly, we need to calculate different metrics such as assembly size, scaffolds number or N50 value. These metrics will allow us to evaluate the quality of this assembly.
Hands-on: Fasta statistics on Flye assembly
- Fasta Statistics ( Galaxy version 2.0) with the following parameters:
- param-file “fasta or multifasta file”:
consensus
(output of Flye tool)
Hands-on: Fasta statistics on the reference assembly
- Fasta Statistics ( Galaxy version 2.0) with the following parameters:
- param-file “fasta or multifasta file”:
Mucmuc1_AssemblyScaffolds.fasta
Question
- Compare the different metrics obtained for Flye assembly and reference genome.
- What can you conclude about the quality of this new assembly ?
- We compare the metrics of the two genome assembly:
- The Flye assembly: 1461 contigs/scaffolds, N50 = 222 kb, length max = 897 kb, size = 48.6 Mb, 36.6% GC
- The reference genome: 456 contigs/scaffolds, N50 = 202 kb, length max = 776 kb, size = 46.1 Mb, 36.7% GC
- Metrics are very similar, Flye generated an assembly with a quality similar to that of the reference genome.
Genome assemblies comparison with Quast
Another way to calculate metrics assembly is to use QUAST = QUality ASsessment Tool. Quast is a tool to evaluate genome assemblies by computing various metrics and to compare genome assembly with a reference genome. The manual of Quast is here: Quast
Hands-on: Task description
- Quast ( Galaxy version 5.0.2+galaxy3) with the following parameters:
- “Use customized names for the input files?”:
No, use dataset names
- param-file “Contigs/scaffolds file”:
consensus
(output of Flye tool)- “Type of assembly”:
Genome
- “Use a reference genome?”:
Yes
- param-file “Reference genome”:
Mucmuc1_AssemblyScaffolds.fasta
- “Type of organism”:
Fungus: use of GeneMark-ES for gene finding, ...
QuestionWhat additional informations are generated by Quast, compared to the Fasta Statistics outputs?
Quast allows us to compare Flye assembly to the reference genome:
- Genome fraction (90.192 %) is the percentage of aligned bases in the reference genome.
- Duplication ratio (1.094) is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome.
- Largest alignment (698452) is the length of the largest continuous alignment in the assembly.
- Total aligned length (45.2 Mb) is the total number of aligned bases in the assembly.
Quast also generates some plots:
- Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.
- GC content plot shows the distribution of GC content in the contigs.
Genome assembly assessment with BUSCO
BUSCO (Benchmarking Universal Single-Copy Orthologs) allows a measure for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content. Details for this tool are here: Busco website
Hands-on: BUSCO on Flye assemblyFirst on the Flye assembly:
- Busco ( Galaxy version 5.2.2+galaxy0) with the following parameters:
- param-file “Sequences to analyse”:
consensus
(output of Flye tool)- “Auto-detect or select lineage”:
Select lineage
- “Lineage”:
Mucorales
Then, on the reference assembly:
- Busco ( Galaxy version 5.2.2+galaxy0) with the following parameters:
- param-file “Sequences to analyse”:
Mucmuc1_AssemblyScaffolds.fasta
- “Auto-detect or select lineage”:
Select lineage
- “Lineage”:
Mucorales
QuestionCompare the number of BUSCO genes identified in the Flye assembly and the reference genome. What do you observe ?
Short summary generated by BUSCO indicates that reference genome contains:
- 2327 Complete BUSCOs (of which 2302 are single-copy and 25 are duplicated),
- 13 fragmented BUSCOs,
- 109 missing BUSCOs.
Short summary generated by BUSCO indicates that Flye assembly contains:
- 2348 complete BUSCOs (2310 single-copy and 38 duplicated),
- 8 fragmented BUSCOs
- 93 missing BUSCOs.
BUSCO analysis confirms that these two assemblies are of similar quality, with similar number of complete, fragmented and missing BUSCOs genes.
Conclusion
This pipeline shows how to generate and evaluate a genome assembly from long reads PacBio data. Once you are satisfied with your genome sequence, you might want to annotate it: have a look at the RepeatMasker and Funannoate tutorials to learn how to do it!