De Bruijn Graph Assembly

Author(s)	Simon Gladman Helena Rasche Saskia Hiltemann
Reviewers

Overview
Questions:

What are the factors that affect genome assembly?

How does Genome assembly work?

Objectives:

Perform an optimised Velvet assembly with the Velvet Optimiser

Compare this assembly with those we did in the basic tutorial

Perform an assembly using the SPAdes assembler.

Requirements:

Introduction to Galaxy Analyses

slides Slides: Quality Control

tutorial Hands-on: Quality Control

Time estimation: 2 hours

Level: Introductory Introductory

Supporting Materials:

Slides

Datasets

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

Galaxy@AuBi ✅

UseGalaxy.be ✅

UseGalaxy.no ✅

Possibly Working

UseGalaxy.cz

Published: May 24, 2017

Last modification: Feb 1, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00031

rating Rating: 2.9 (0 recent ratings, 9 all time)

version Revision: 25

Optimised de Bruijn Graph assemblies using the Velvet Optimiser and SPAdes

In this activity, we will perform de novo assemblies of a short read set using the Velvet Optimiser and the SPAdes assemblers. We are using the Velvet Optimiser for illustrative purposes. For real assembly work, a more suitable assembler should be chosen - such as SPAdes.

The Velvet Optimiser is a script written by Simon Gladman to optimise the k-mer size and coverage cutoff parameters for Velvet. More information can be found in its repository.

SPAdes is a de novo genome assembler written by Pavel Pevzner’s group in St. Petersburg. More details on it can be found on Spades’ website>

Agenda

In this tutorial, we will deal with:

Get the data

Assemble with the Velvet Optimiser

Assemble with SPAdes

Get the data

We will be using the same data that we used in the introductory tutorial, so if you have already completed that and have the data, skip this section.

Hands On: Getting the data
Create and name a new history for this tutorial.

To create a new history simply click the new-history icon at the top of the history panel:
Import the sequence read raw data (*.fastq) from Zenodo
https://zenodo.org/record/582600/files/mutant_R1.fastq
https://zenodo.org/record/582600/files/mutant_R2.fastq
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Rename the files galaxy-pencil

The name of the files are the full URL, let’s make the names a little clearer

Change the names to just the last part, Mutant_R1.fastq, Mutant_R2.fastq respectively

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Question

What are four key features of a FASTQ file?

What is the main difference between a FASTQ and a FASTA file?

Assembly with the Velvet Optimiser

We will perform an assembly with the Velvet Optimiser, which automatically runs and optimises the output of the Velvet assembler (Zerbino and Birney 2008). It will automatically choose a suitable value for the k-mer size (k). It will then go on to optimise the coverage cutoff (cov_cutoff) which corrects for read errors. It will use the “n50” metric for optimising the k-mer size and the “total number of bases in contigs” for optimising the coverage cutoff.

Hands On: Assemble with the Velvet Optimiser

Velvet Optimiser tool: Optimise your assembly with the following parameters:

“Start k-mer size”: 45

“End k-mer size”: 73

“Input file type”: Fastq

“Single or paired end reads”: Paired

param-file “Select first set of reads”: mutant_R1.fastq

param-file “Select second set of reads”: mutant_R2.fastq

Your history will now contain a number of new files:

Velvet optimiser contigs
- A fasta file of the final assembled contigs
Velvet optimiser contig stats
- A table of the lengths (in k-mer length) and coverages (k-mer coverages) for the final contigs.

Have a look at each file.

Hands On: Get contig statistics for Velvet Optimiser contigs

Fasta Statistics tool: Produce a summary of the velvet optimiser contigs:

param-file “fasta or multifasta file”: Select your velvet optimiser contigs file

View the output

Question

Compare the output we got here with the output of the simple assemblies obtained in the introductory tutorial.

What are the main differences between them?

Which has a higher “n50”? What does this mean?

Tables of results from (a) Simple assembly and (b) optimised assembly.

(a)

(b)

Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler (Zerbino et al. 2009)

Visualisation of the Assembly

Now that we’ve assembled the genomes, let’s visualise this assembly using Bandage (Wick et al. 2015). This tool will let us better understand how the assembly graph really looks, and can give us a feeling for if the genome was well assembled or not.

Currently VelvetOptimiser does not include the LastGraph output, so we will manually run velveth and velvetg with the optimised parameters.

Hands On: Manually running velvetg/h

Locate the output called “VelvetOptimiser: Contigs” in your history

Click the (i) information icon

Check the tool stderr in the information page for the optimised k-mer value

Question

What was the optimal k-mer value? (referred to as “hash” in the stderr log)

55

With this information in hand, let’s run velvet:

Hands On: Manually running velvetg/h

velveth tool: Prepare a dataset for the Velvet velvetg Assembler

“Hash length”: 55

“Insert Input Files”:

1: Input Files

“file format”: fastq

“read type”: shortPaired reads

“Dataset”: mutant_R1.fastq

“Insert Input Files”:

2: Input Files

“file format”: fastq

“read type”: shortPaired reads

“Dataset”: mutant_R2.fastq

velvetg tool: Velvet sequence assembler for very short reads

“Velvet dataset”: output from velveth tool

“Generate velvet LastGraph file”: Yes

“Coverage cutoff”: Specify Cutoff Value

“Remove nodes with coverage below”: 1.44

“Using Paired Reads”: Yes

The LastGraph contains a detailed representation of the De Bruijn graph, which can give us an idea how velvet has assembled the genome and potentially resolved any conflicts.

Hands On: Bandage

Bandage Image tool: visualize de novo assembly graphs

“Graphical Fragment Assembly”: The “LastGraph” output of velvetg tool

“Produce jpg, png or svg file?”: .svg

Execute

View the output file

And now you should be able to see the graph that velvet produced:

Interpreting Bandage Graphs

k-mer size has a significant effect on the assembly. You can play around with various k-mers to see this effect in practice.

k-mer	graph
21
33
53
77

The next thing to be aware of is that there can be multiple valid interpretations of a graph, all equally valid in absence of other data. The following is taken verbatim from Bandage’s wiki:

For a simple case, imagine a bacterial genome that contains a single repeated element in two separate places in the chromosome:

A researcher (who does not yet know the structure of the genome) sequences it, and the resulting 100 bp reads are assembled with a de novo assembler:

Because the repeated element is longer than the sequencing reads, the assembler was not able to reproduce the original genome as a single contig. Rather, three contigs are produced: one for the repeated sequence (even though it occurs twice) and one for each sequence between the repeated elements.

Given only the contigs, the relationship between these sequences is not clear. However, the assembly graph contains additional information which is made apparent in Bandage:

There are two principal underlying sequences compatible with this graph: two separate circular sequences that share a region in common, or a single larger circular sequence with an element that occurs twice:

Additional knowledge, such as information on the approximate size of the bacterial chromosome, can help the researcher to rule out the first alternative. In this way, Bandage has assisted in turning a fragmented assembly of three contigs into a completed genome of one sequence.

Assemble with SPAdes

We will now perform an assembly with the much more modern SPAdes assembler (Bankevich et al. 2012). It goes through a similar process to Velvet in the fact that it uses and simplifies de Bruijn graphs but it uses multiple values for k-mer size and combines the resultant graphs. This combination produces very good assemblies. When using SPAdes it is typical to choose at least 3 k-mer sizes. One low, one medium and one high. We will use 33, 55 and 91.

Hands On: Assemble with SPAdes

SPAdes tool: Assemble the reads:

“Run only assembly”: yes

“K-mers to use separated by commas”: 33,55,91 [note: no spaces!]

“Coverage cutoff”: auto

param-file “Files -> forward reads”: mutant_R1.fastq

param-file “Files -> reverse reads”: mutant_R2.fastq

“Output final assembly graph with scaffolds?”: Yes

You will now have 5 new files in your history:

two Fasta files, one for contigs and one for scaffolds
two statistics files, one for contigs and one for scaffolds
the SPAdes log file.

Examine each file, especially the stats files.

Question

Why would one of the contigs have much higher coverage than the others?

What could this represent?

Hands On: Visualize assembly with Bandage

Bandage tool with the following parameters:

“Graphical Fragment Assembly”: assembly graph with scaffolds output from SPAdes tool

Examine the output image galaxy-eye

The visualized assembly should look something like this:

Question

Which assembly looks better to you? Why?

Hands On: Get contig statistics for SPAdes contigs

Fasta Statistics tool: Produce a summary of the SPAdes contigs:

param-file “fasta or multifasta file”: Select your velvet optimiser contigs file

Look at the output file.

Question

Compare the output we got here with the output of the simple assemblies obtained in the introductory tutorial.

What are the main differences between them?

Did SPAdes produce a better assembly than the Velvet Optimiser?

You've Finished the Tutorial

Key points

We learned about how the choice of k-mer size will affect assembly outcomes

We learned about the strategies that assemblers use to make reference genomes

We performed a number of assemblies with Velvet and SPAdes.

You should use SPAdes or another more modern assembler than Velvet for actual assemblies now.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Zerbino, D. R., and E. Birney, 2008 Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18: 821–829. 10.1101/gr.074492.107
Zerbino, D. R., G. K. McEwen, E. H. Margulies, and E. Birney, 2009 Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler (S. L. Salzberg, Ed.). PLoS ONE 4: e8407. 10.1371/journal.pone.0008407
Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin et al., 2012 SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19: 455–477. 10.1089/cmb.2012.0021
Wick, R. R., M. B. Schultz, J. Zobel, and K. E. Holt, 2015 Bandage: interactive visualization of \lessi\greaterde novo\less/i\greater genome assemblies. Bioinformatics 31: 3350–3352. 10.1093/bioinformatics/btv383

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Simon Gladman, Helena Rasche, Saskia Hiltemann, De Bruijn Graph Assembly (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/debruijn-graph-assembly/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{assembly-debruijn-graph-assembly,
author = "Simon Gladman and Helena Rasche and Saskia Hiltemann",
	title = "De Bruijn Graph Assembly (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/debruijn-graph-assembly/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/assembly/tutorials/debruijn-graph-assembly/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: fasta_stats
  owner: iuc
  revisions: 9c620a950d3a
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: spades
  owner: nml
  revisions: b7829778729f
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: velvetoptimiser
  owner: simon-gladman
  revisions: 37d88f41c810
  tool_panel_section_label: Assembly
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 2

4 stars 2

2 stars 3

1 stars 2

September 2024

1 stars: Liked: short, comparison of two methods Disliked: tool versions, answers to the questions, adding bandage to workflow, generated bandage pics look completely dfferent

June 2024

2 stars: Liked: Most of it! I'm mostly informing y'all at galaxy about a specific issue. Disliked: The four images following this line "For a simple case, imagine a bacterial genome that contains a single repeated element in two separate places in the chromosome:" are unavailable for viewing.

August 2023

2 stars: Liked: Spades tutorial is correct Disliked: Velvet tutorial is not working, I follow the instructions and didn't got a final image it went white all, so think is something up to the tutorial or the tool

March 2019

5 stars: Liked: Easy to follow

September 2018

4 stars: Liked: Everything works with provided data and the scale is good for use in class Disliked: Could you provide the fragment size separating the paired ends? It would also be nice to have more info for instructors about the genome for doing additional exercises based on the assemblies.