Assembly of metagenomic sequencing data

Overview
Creative Commons License: CC-BY Questions:
  • Why metagenomic data should be assembled?

  • What is the difference between co-assembly and individual assembly?

  • What is the difference between reads, contigs and scaffolds?

  • How tools based on De Bruijn graph work?

  • How to assess the quality of metagenomic data assembly?

Objectives:
  • Describe what an assembly is

  • Describe what de-replication is

  • Explain the difference between co-assembly and individual assembly

  • Explain the difference between reads, contigs and scaffolds

  • Explain how tools based on De Bruijn graph work

  • Apply appropriate tools for analyzing the quality of metagenomic data

  • Construct and apply simple assembly pipelines on short read data

  • Apply appropriate tools for analyzing the quality of metagenomic assembly

  • Evaluate the Quality of the Assembly with Quast, Bowtie2, and CoverM-Genome

Requirements:
Time estimation: 2 hours
Level: Introductory Introductory
Supporting Materials:
Published: Apr 20, 2024
Last modification: Apr 20, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00035
version Revision: 0

Metagenomics involves the extraction, sequencing and analysis of combined genomic DNA from entire microbiome samples. It includes then DNA from many different organisms, with different taxonomic background.

Reconstructing the genomes of microorganisms in the sampled communities is critical step in analyzing metagenomic data. To do that, we can use assembly and assemblers, i.e. computational programs that stich together the small fragments of sequenced DNA produced by sequencing instruments.

Assembling seems intuitively similar to putting together a jigsaw puzzle. Essentially, it looks for reads “that work together” or more precisely, reads that overlap. Tasks like this are not straightforward, but rather complex because of the complexity of the genomics (specially the repeats), the missing pieces and the errors introduced during sequencing.

Comment

Do you want to learn more about the principles behind single genome assembly? Follow our tutorials.

Metagenomic assembly is further complicated by

  • the large volume of data produced
  • the quality of the sequence
  • the unequal representation of members of the microbial community
  • the presence of closely related microorganisms with similar genomes
  • the presence of several strains of the same microorganism
  • an insufficient amount of data for minor community members

For assembly, there are 3 main strategies:

  1. Greedy extension
  2. Overlap Layout Consensus
  3. De Bruijn graphs. The following figure illustrates these strategies in brief.
Image shows greedy extention, overlap layout consensus, and de Brujin graphs assembly algorithms. Open image in new tab

Figure 1: Assembly algorithms. Image from Metagenome Assembly – Data Processing and Visualization for Metagenomics

The nice paper Miller et al. 2010 on assemblers based on these algorithms will help you to better understand how they work.

For metagenomic assembly, several tools exist: metaSPAdes (Nurk et al. 2017), MEGAHIT (Li et al. 2015), etc. The different assemblers have different computational characteristics and their performance varies according to the microbiome as shown in by both rounds of Critical Assessment of Metagenome Interpretation initiative (Sczyrba et al. 2017, Meyer et al. 2022, Meyer et al. 2021). The preference of one assembler over another depends on the purpose at hand.

In this tutorial, we will learn how to run metagenomic assembly tool and evaluate the quality of the generated assemblies. To do that, we will use data from the study: Temporal shotgun metagenomic dissection of the coffee fermentation ecosystem. For an in-depth analysis of the structure and functions of the coffee microbiome, a temporal shotgun metagenomic study (six time points) was performed. The six samples have been sequenced with Illumina MiSeq utilizing whole genome sequencing.

Based on the 6 original dataset of the coffee fermentation system, we generated mock datasets for this tutorial.

Agenda

In this tutorial, we will cover:

  1. Prepare analysis history and data
  2. Assembly
  3. Quality control of assembly
    1. Assembly statistics
    2. Icarus contig browser
  4. Conclusion

Prepare analysis history and data

To run assembly, we first need to get the data into Galaxy. Any analysis should get its own Galaxy history. So let’s start by creating a new one:

Hands-on: Prepare the Galaxy history
  1. Create a new history for this analysis

    Click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Rename the history

    1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
    2. Type the new name
    3. Click on Save

    If you do not have the galaxy-pencil (Edit) next to the history name:

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

We need to get the data into our history.

In case of a not very large dataset it’s more convenient to upload data directly from your computer to Galaxy.

Hands-on: Upload data into Galaxy
  1. Import the sequence read raw data (*.fastqsanger.gz) from Zenodo or a data library:

    https://zenodo.org/record/7818827/files/ERR2231567_1.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231567_2.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231568_1.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231568_2.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231569_1.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231569_2.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231570_1.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231570_2.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231571_1.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231571_2.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231572_1.fastqsanger.gz
    https://zenodo.org/record/7818827/files/ERR2231572_2.fastqsanger.gz
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Shared data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

    Comment

    In case of large dataset, we can use FTP server or the Galaxy Rule-based Uploader.

  2. Create a paired collection named Raw reads, rename your pairs with the sample name

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Build List of Dataset Pairs

    • Change the text of unpaired forward to a common selector for the forward reads
    • Change the text of unpaired reverse to a common selector for the reverse reads
    • Click Pair these datasets for each valid forward and reverse pair.
    • Enter a name for your collection
    • Click Create List to build your collection
    • Click on the checkmark icon at the top of your history again

Assembly

As explained before, there are many challenges to metagenomics assembly, including:

  1. differences in coverage between samples, resulting from differences in abundance,
  2. the fact that different species often share conserved regions (Kececioglu and Ju 2001), and
  3. the presence of multiple strains of a single species (Miller et al. 2010).

To reduce the differences in coverage between samples, we can use a co-assembly approach, where reads from all samples are aligned together.:

Image show one pile of sample1 reads and another pile of sample2 reads, then, green arrow leads to assembled reads from both piles. Open image in new tab

Figure 2: Co-assembly
Pros of co-assembly Cons of co-assembly
More data Higher computational overhead
Better/longer assemblies Risk of shattering your assembly
Access to lower abundant organisms Risk of increased contamination

Co-assembly is then not always beneficial:

  • Changes in strain can cause the assembly graph to collapse
  • Binned contigs are likely to be misclassified: MAGs must be treated as a population genome.
Image shows the process of assembled reads from 2 samples followed by binning and there is detailed information about broken contigs, dominant coverage, chimeric contigs found during assembly process. Open image in new tab

Figure 3: Co-assembly process

In these cases, co-assembly is reasonable if:

  • Same samples
  • Same sampling event
  • Longitudinal sampling of the same site
  • Related samples

If it is not the case, individual assembly should be prefered. In this case, an extra step of de-replication should be used:

Image shows the process of individual assembly on two strains and five samples, after individual assembly of samples two samples are chosen for de-replication process. In parallel, co-assembly on all five samples is performed. Open image in new tab

Figure 4: Individual assembly followed by de-replication vs co-assembly. Source: dRep documentation

Co-assembly is more commonly used than individual assembly and then de-replication after binning. But in this tutorial, to show all steps, we will run an individual assembly.

Comment

Sometimes it is important to run assembly tools both on individual samples and on all pooled samples, and use both outputs to get the better outputs for the certain dataset.

As mentioned in the introduction, several tools are available for metagenomic assembly. But 2 are the most used ones:

  • MetaSPAdes (Nurk et al. 2017): an short-read assembler designed specifically for large and complex metagenomics datasets

    MetaSPAdes is part of the SPAdes toolkit, which has several assembly pipelines. Since SPAdes handles non-uniform coverage, it is useful for assembling simple communities, but metaSPAdes also handles other problems, allowing it to assemble complex communities’ metagenomes.

    As input for metaSPAdes it can accept short reads. However, there is an option to use additionally long reads besides short reads to produce hybrid input.

  • MEGAHIT (Li et al. 2015): a single node assembler for large and complex metagenomics NGS reads, such as soil

    It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly.

Both tools are available in Galaxy. But currently, only MEGAHIT can be used in individual mode for several samples.

Hands-on: Individual assembly of short-reads with MEGAHIT
  1. MEGAHIT ( Galaxy version 1.2.9+galaxy0) with parameters:
    • “Select your input option”: Paired-end collection
      • “Run in batch mode?”: Run individually

        Comment

        To run as co-assembly, select Merge all fastq pair-end, instead of Run individually

      • param-collection “Select a paired collection”: Raw reads

    • In Basic assembly options
      • “K-mer specification method”: Specify min, max, and step values
        • “Minimum kmer size”: 21
        • “Maximum kmer size”: 91
        • “Increment of kmer size of each iteration”: 12

MEGAHIT produced a collection of output assemblies - one per sample - that can be proceeded further in binning step and then de-replication. The output contains contigs, contiguous lengths of genomic sequences in which bases are known to a high degree of certainty.

Contrary to MetaSPAdes, MEGAHIT does not output scaffolds, i.e. segments of genome sequence reconstructed fron contigs and gaps. The gaps occur when reads from the two sequenced ends of at least one fragment overlap with other reads from two different contigs (as long as the arrangement is otherwise consistent with the contigs being adjacent). It is possible to estimate the number of bases between contigs based on fragment lengths.

Comment

Since the assembly process would take ~1h we are just going to import the results of the assembly previously run.

Hands-on: Import generated assembly files
  1. Import the six contig files from Zenodo or the Shared Data library:

    https://zenodo.org/record/7818827/files/contigs_ERR2231567.fasta
    https://zenodo.org/record/7818827/files/contigs_ERR2231568.fasta
    https://zenodo.org/record/7818827/files/contigs_ERR2231569.fasta
    https://zenodo.org/record/7818827/files/contigs_ERR2231570.fasta
    https://zenodo.org/record/7818827/files/contigs_ERR2231571.fasta
    https://zenodo.org/record/7818827/files/contigs_ERR2231572.fasta
    
  2. Create a collection named MEGAHIT Contig, rename your pairs with the sample name

Question
  1. How many contigs has been for ERR2231568 sample?
  2. And for ERR2231572?
  3. What is the minimum length of the contigs?
  1. There are 228,719 sequences in the file so 228,719 contigs
  2. 122,526 contigs
  3. Sequences seems bigger than 200 bp (len attribute of the sequence information in Fasta files). It is the default value set up in MEGAHIT.
Hands-on: Assembly with MetaSPAdes
  1. MetaSPAdes ( Galaxy version 3.15.4+galaxy2) with following parameters
    • “Pair-end reads input format”: Paired-end: list of dataset pairs
      • param-collection “FASTQ file(s): collection”: Raw reads
    • “Select k-mer detection option”: User specific
      • “K-mer size values”: 21,33,55,77

Quality control of assembly

Once assembly is done, it is important to check its quality.

Assemblies can be evaluated with metaQUAST (Mikheenko et al. 2016), the metagenomics mode of QUAST (Gurevich et al. 2013).

Hands-on: Evaluation assembly quality with metaQUAST
  1. Quast ( Galaxy version 5.2.0+galaxy1) with parameters:
    • “Assembly mode?”: Individual assembly (1 contig file per samples)
      • “Use customized names for the input files?”: No, use dataset names
        • param-collection “Contigs/scaffolds file”: output MEGAHIT
      • “Reads options”: Illumina paired-end reads in paired collection

        Comment

        To make the job quicker, you can select Disabled here. The raw reads will then not been mapped to the assembly to compute metrics, like the coverage.

      • param-collection “FASTQ/FASTA files”: Raw reads
    • “Type of assembly”: Metagenome
    • “Output files”: HTML report, PDF report, Tabular reports, Log file, Key metric summary (metagenome mode), Krona charts (metagenome mode without reference genomes)
  2. Inspect the HTML reports
Comment

Since the Quast process would take times we are just going to import the results:

Hands-on: Import generated metaQuast results
  1. Import the metaQuast report file from Zenodo or the Shared Data library:

    https://zenodo.org/record/7818827/files/quast_ERR2231567.html
    https://zenodo.org/record/7818827/files/quast_ERR2231568.html
    https://zenodo.org/record/7818827/files/quast_ERR2231569.html
    https://zenodo.org/record/7818827/files/quast_ERR2231570.html
    https://zenodo.org/record/7818827/files/quast_ERR2231571.html
    https://zenodo.org/record/7818827/files/quast_ERR2231572.html
    

Quast main output are HTML reports which aggregate different metrics.

Assembly statistics

On the top of each report is a table with in rows statistics for contigs larger than 500 bp for the different sample assemblies (columns). Let’s now look at the table and go from top to bottom:

  1. Genome statistics

    • Genome fraction (%): percentage of aligned bases in the reference genome

      A base in the reference genome is counted as aligned if at least one contig has at least one alignment to this base.

      We did not provide any reference there, but metaQuast try to identify genome content of the metagenome by aligning contigs to SILVA 16S rRNA database. For each assembly, 50 reference genomes with top scores are chosen. The full reference genomes of the identified organisms are afterwards downloaded from NCBI to map the assemblies on them and compute the genome fractions.

      For each identified genomes, the genome fraction is given when clicking on Genome fraction (%)

      Question
      1. What is the genome fraction for ERR2231568? And for ERR2231572?
      2. Which reference genome has the highest genome fraction for ERR2231568? And for ERR2231572?
      1. The genome fraction is 30.22% for ERR2231568 and 58.73% for ERR2231572
      2. The highest genome fraction was found for Leuconostoc pseudomesenteroides for ERR2231568 (844%) and for Lactobacillus for ERR2231572 (91%). The genomes of Leuconostoc pseudomesenteroides and Lactobacillus could be then almost completely recovered from the assemblies of ERR2231568 and ERR2231572 respectively.
    • Duplication ratio: total number of aligned bases / genome fraction * reference length

      If an assembly contains many contigs that cover the same regions of the reference, the duplication ratio may be much larger than 1.

      Question
      1. What is the duplication ratio for ERR2231568? And for ERR2231572?
      2. Which reference genome has the highest duplication ratio for ERR2231568? And for ERR2231572?
      1. The duplication ratio is 1.068 for ERR2231568 and 1.1 for ERR2231572 (column ERR2231572 in ERR2231572 report)
      2. The highest duplication ratio was found for Gluconobacter kondonii for ERR2231568 (1.156) and for Lactobacillus brevis KB290 for ERR2231572 (1.178).
  2. Read mapping: results of the mapping of the raw reads on the different assemblies (only if the “Reads options” is not disabled)

    Question
    1. What is the % of read mapped for ERR2231568 assembly to ERR2231568 raw reads? And for ERR2231572 assembly to ERR2231572 raw reads?
    2. What is the percentage of reads used to build the assemblies for ERR2231568? and ERR2231572?
    1. 79.47% of ERR2231568 raw reads were mapped to ERR2231568 assembly and 86.98% of ERR2231572 raw reads to ERR2231572 assembly.
    2. 79.47% of reads were used to the assemblies for ERR2231568 and 86.97% for ERR2231572.

    2 alternative ways to compute coverage are to

    1. Use CoverM, which is available in Galaxy

      Hands-on: Calculate coverage using CoverM
      1. CoverM-CONTIG ( Galaxy version 0.2.1) with parameters:
        • “Read type”: Paired collection
          • param-collection “One or more pairs of forward and reverse possibly gzipped FASTA/Q files for mapping in order”: Raw reads
        • param-collection “FASTA file(s) of contigs”: output of MEGAGIT
      2. Inspect the HTML report for ERR2231568
    2. Map the original reads onto contigs and extract the percentage of mapped reads:

      Hands-on: Computation of the % reads used in assemblies
      1. Bowtie2 ( Galaxy version 2.5.0+galaxy0) with the following parameters:
        • “Is this single or paired library”: Paired-end Dataset Collection
          • param-collection “FASTQ Paired Dataset”: Raw reads
        • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from the history and build index
          • param-collection “Select reference genome”: MEGAHIT output
        • “Save the Bowtie2 mapping statistics to the history”: Yes
      2. Inspect the mapping statistics output
      Question
      1. What is the overall alignment rate for ERR2231567? and ERR2231571?
      2. What is the percentage of reads used in assemblies for ERR2231567? and ERR2231571?
      1. The overall alignment rate for ERR2231567 is 65.97% and 73.67% for ERR2231571
      2. 65.97% of the reads were used in assemblies for ERR2231567 and 73.67% for ERR2231571.
  3. Misassemblies: joining sequences that should not be adjacent.

    Quast identifies missassemblies by mapping the contigs to the reference genomes of the identified organisms. 3 types of misassemblies can be identified:

    Image shows on the top a contig with a blue and a gren parts with white arrows (pointing on the right) on them and below a reference with 2 chromosomes. The 3 types of misassemblies are after schematized. Relocation: the blue and gren parts of the contig are on chr 1 but separated. Inversion: the blue and gren parts of the contig are on chr 1 but separated and with the arrows facing each other. Translocation: the blue part is on chr 1 and gren part on chr 2.Open image in new tab

    Figure 5: Source: QUAST manual
    1. Relocation occur based on signal from two mappings of the same contig against the same chromosome, with 2 cases:
      1. the 2 mappings are separated by an unmapped region of at least 1 kbp
      2. they map on the same chromosome with a shared mapping area of at least 1 kbp
      Image shows on the top a contig with a blue and a gren parts with white arrows (pointing on the right) on them and below a reference with 2 chromosomes. The 3 types of misassemblies are after schematized. Relocation: the blue and gren parts of the contig are on chr 1 but separated. Inversion: the blue and gren parts of the contig are on chr 1 but separated and with the arrows facing each other. Translocation: the blue part is on chr 1 and gren part on chr 2.
      Open image in new tab

      Figure 6: Source: Yet Another Bioinformatic blog by Pierre Marijon
      Question
      1. How many relocations has been found for ERR2231568? And for ERR2231572?
      2. For which reference genomes are there the most relocation found for ERR2231568? And for ERR2231572?
      1. 78 for ERR2231568 and 151 for ERR2231572
      2. Leuconostoc pseudomesenteroides and Tatumella morbirosei for ERR2231568 and Lactobacillus plantarum argentoratensis for ERR2231572
    2. Translocation occur when a contig has mapped on more than one reference chromosomes

      Question
      1. How many translocations has been found for ERR2231568? And for ERR2231572?
      2. For which reference genomes are there the most translocations found for ERR2231568? And for ERR2231572?
      3. What are the interspecies translocations?
      4. How many interspecies translocations has been found for ERR2231568? And for ERR2231572?
      1. 25 for ERR2231568 and 55 for ERR2231572.
      2. Leuconostoc pseudomesenteroides for ERR2231568 and Lactobacillus vaccinostercus for ERR2231572.
      3. Interspecies translocations are translocations where a contig has mapped on different reference genomes.
      4. 80 for ERR2231568 and 144 for ERR2231572.
    3. Inversion occurs when a contig has two consecutive mappings on the same chromosome but in different strands

      Question
      1. How many inversion has been found for ERR2231568? And for ERR2231572?
      2. For which reference genomes are there the most inversions found for ERR2231568? And for ERR2231572?
      1. 4 for ERR2231568 and 6 for ERR2231572.
      2. Tatumella morbirosei for ERR2231568 and Lactobacillus sp for ERR2231572.
  4. Mismatches or mismatched bases in the contig-reference alignment

    Question
    1. How many mismatches have been identified for ERR2231568? And for ERR2231572?
    2. For which reference genomes are there the most mismatches for ERR2231568? And for ERR2231572?
    1. 503,352 for ERR2231568 and 287,270 for ERR2231572.
    2. Pantoea rwandensis for ERR2231568 and Leuconostoc brevis KB290 for ERR2231572.
  5. Statistics without reference

    • # contigs: total number of contigs

      Question
      1. How many contigs are for ERR2231568? And for ERR2231572?
      2. How many sequences are in the output of MEGAHIT for ERR2231568? And for ERR2231572?
      3. Why are these numbers different from the number of sequences in the output of MEGAHIT?
      4. Which statistics in the metaQUAST report corresponds to number of sequences in the output of MEGAHIT?
      5. Which reference genomes have the most contigs (\(\geq\) 500 bp) in ERR2231568? And in ERR2231572?
      1. 66,434 contigs for ERR2231568 and 36,112 for ERR2231572.
      2. In the outputs of MEGAHIT, there are 228,719 contigs for ERR2231568 and 122,526 contigs.
      3. The numbers are lower in the metaQUAST results because metaQUAST reports there only the contigs longer than 500bp.
      4. The # contigs (>= 0 bp)
      5. Except the non aligned contigs, Tatumella morbirosei for ERR2231568 and Leuconostoc brevis KB290 for ERR2231572.
    • Largest contig: length of the longest contig in the assembly

      Question
      1. What is the length of the longest contig in ERR2231568? And in ERR2231572?
      2. Is the longest contig assigned to a reference genome in ERR2231568? And in ERR2231572?
      1. 63,871 bp in ERR2231568 and 65,608 for ERR2231572.
      2. It is assigned to Leuconostoc pseudomesenteroides KCTC 3652 in ERR2231568 and not assigned in ERR2231572.
    • N50: length for which the collection of all contigs of that length or longer covers at least half an assembly

      N50 statistic defines assembly quality in terms of contiguity. If all contigs in an assembly are ordered by length, the N50 is the minimum length of contigs that contains 50% of the assembled bases. For example, an N50 of 10,000 bp means that 50% of the assembled bases are contained in contigs of at least 10,000 bp.

      Another example. Let’s consider 9 contigs with the lengths 2, 3, 4, 5, 6, 7, 8, 9, and 10:

      • The sum of the length is 54
      • Half of the sum is 27
      • 10 + 9 + 8 = 27 (half the length of the sequence)
      • N50 = 8, i.e. the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome
      Question
      1. What is N50 for ERR2231568? And for ERR2231572?
      2. What is N90?
      1. 921 for ERR2231568 and 1,233 for ERR2231572.
      2. N90 is similar to the N50 metric but with 90% of of the sum of the lengths of all contigs

      When comparing N50 values from different assemblies, the assembly sizes must be the same size in order for N50 to be meaningful.

      Also the N50 alone is not a useful measure to assess the quality of an assembly. For example, the assemblies with the following contig lengths:

      • 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 25, 25, 150, 1500
      • 50, 500, 530, 650

      Both assemblies have the same N50 although one is more contiguous than the other.

    • L50: number of contigs equal to or longer than N50

      In other words, L50, for example, is the minimal number of contigs that cover half the assembly.

      If we take the previous example in N50, L50 = 3.

      Question
      1. What is the L50 for ERR2231568? And for ERR2231572?
      1. 17,280 for ERR2231568 and 7,496 for ERR2231572.

Icarus contig browser

Icarus generates contig size viewer and one or more contig alignment viewers (if reference genome/genomes are provided) that are accessible from the HTML report, by clicking on View on Icarus contig browser.

Contig size viewer

This viewer draws contigs ordered from longest to shortest. Let’s inspect this viewer for ERR2231568.

Question

Open the Contig size viewer for ERR2231568 and define start as 0 and end as 500000

Image shows on the Icarus Contig size viewer for ERR2231568, with a zoom between 0 500000. Below the menu, the contigs are drawn from the longest on the left to the shortest on the right. Each contig is filled with a different color: green for correct, red for misassembled, etc. Below the contigs is dispayed a bar to navigate through the contigs.

  1. What is the color of the first contig? Why?
  2. What is the red contig?
  1. The first contig is white because >50% of the contigs is unaligned. By clicking on the contig, we see that only a small block is aligned: 223.41 – 223.65 kbp to Leuconostoc_pseudomesenteroides_KCTC_3652_NZ_BMBP01000002.1.
  2. The red contig is a missamblied contig: it contains 2 blocks, with a translocation between them.

Click Main menu on the top left to go back to the main Icarus page.

Contig alignment viewer

If a reference genome is provided, there should be a table on the main Icarus page that looks like:

Genome # fragments Length, bp Mean genome fraction, % # misassembled blocks
Gluconobacter oxydans H24 2 3 816 232 10.989 38
Kosakonia cowanii 5 4 806 998 28.224 60
Lactococcus lactis subsp. lactis CV56 6 2 518 737 18.940 23  

When clicking on the genome name, the contigs are displayed according to their mapping to the reference genome. The viewer can additionally visualize genes, operons, and read coverage distribution along the genome, if any of those were fed to QUAST.

Question

Open the Contig alignment viewer for ERR2231568 and Leuconostoc pseudomesenteroides KCTC 3652, the most covered by contigs

Image shows on the Icarus alignment size viewer of ERR2231568 contigs for Leuconostoc pseudomesenteroides KCTC 3652. On the top are displayed the contigs aligend on the reference genome, with a zoom. Each contig is filled with a different color: green for correct, red for misassembled, etc. Below the contigs is dispayed a bar to navigate through the contigs. Below is a graph representing the GC percentage and coverage along the reference genome.

  1. How are the organized the contigs on the top?
  2. What the different colors for the contigs?
  3. Why is there a big red block on the right?
  4. What is the graph on the bottom?
  1. The contigs are displayed based on their mapping on the reference genome of Leuconostoc pseudomesenteroides KCTC 3652
  2. The different colors represent the different status for the contig: correct contigs, correct contigs but with >50% of the contig unaligned, misassembled blocks. unchecked misassembled blocks, ambiguously mapped contigs, alternative blocks of misassembled contigs, etc.
  3. The big red block on the right is contig k91_88833 with a misassembly on the left side, and overlap with 2 other contigs
  4. The graph on the bottom represents the GC percentage and coverage by contigs along the reference genome

Current metagenome assemblers like MEGAHIT and MetaSPAdes use graphs, most typically a de Bruijn graph to stich reads together. In an ideal case, the graph would contain one distinct path for each genome of each micro-organisms, but complexities such as repeated sequences usually prevent this.

Assembly graphs contain then branching structures: one node may lead into multiple others. Contigs correspond to the longest sequences in the graph that can be determined unambiguously. They are the final results of most assembler. But the assembly graph contains more information. It can be useful for finding sections of the graph, such as rRNA, or to try to find parts of a genome.

Bandage (Wick et al. 2015) is a tool creating interactive visualisations of assembly graphs.

Hands-on: Visualization the assembly graph
  1. megahit contig2fastg ( Galaxy version 1.1.3+galaxy10) with parameters:
    • param-collection “Contig file”: Output of MEGAHIT
    • “K-mer length”: 91

      Comment

      To get the value, you need to

      1. Go into the MEGAHIT output collection
      2. Expand one of the contig file by clicking on it in the history
      3. Check in the dataset peek the name of the contig
      4. Extract the value after the first k in the contig names
  2. Bandage Image ( Galaxy version 0.8.1+galaxy4) with parameters:
    • param-collection “Graphical Fragment Assembly”: Output of megahit contig2fastg
  3. Inspect the generated image for ERR2231571
Image shows the assembly graphs, with longer stretch on the top and many small contigs on the bottom. Open image in new tab

Figure 7: Assembly graph for ERR2231571 sample

The graph is quite disconnected. On the top, we can see the longer stretches, that includes multiples contigs (each contig having a different color). On the bottom are the shortest stretches or single contigs.

But it is really hard to read or extract any information from the graph. Let’s inspect the information about the assembly graph

Hands-on: Visualization the *de novo* assembly graph
  1. Bandage Info ( Galaxy version 0.8.1+galaxy2) with parameters:
    • param-collection “Graphical Fragment Assembly”: Output of megahit contig2fastg
  2. Column join ( Galaxy version 0.0.3) with parameters:
    • param-collection “Tabular files”: Output of Bandage Info
  3. Inspect the generated output
Question
  1. How many nodes are in the graph for ERR2231568? And for ERR2231572? What does they correspond to?
  2. How many edges are in the graph for ERR2231568? And for ERR2231572? What is the impact of these numbers in relation to the number of nodes on the graph?
  3. How many connected components are there for ERR2231568? And for ERR2231572? What does they correspond to?
  4. What is the percentage of dead ends are there for ERR2231568? And for ERR2231572?
  5. What are the smallest and larges edge overlaps?
  6. What is the largest component? For which sample?
  7. What is the shortest node? What does they correspond to?
  1. There are 228,719 nodes for ERR2231568 and 122,526 for ERR2231572. They correspond to the number of contigs
  2. There are 16,580 edges for ERR2231568 and 13,993 for ERR2231572. There are less edges than nodes in the graph. It means that many nodes/contigs are disconnected
  3. There are 212,598 connected components, i.e. number of regions of the graph which are disconnected from each other, for ERR2231568 and 109,044 for ERR2231572
  4. There are 94.0702% dead ends, i.e. the end of a node not connected to any other nodes, for ERR2231568 and 90.7032% for ERR2231572. It confirms the previous observation
  5. The smallest and larges edge overlaps are 91bp, i.e. the k-mer length
  6. The largest component is 340,003 bp for ERR2231567
  7. The shortest node is 200 bp, i.e. the minimal size for a contig

Conclusion

Metagenomic data can be assembled to, ideally, obtain the genomes of the species that are represented within the input data. But metagenomic assembly is complex and there are

  • different approaches like de Bruijn graphs methods
  • different strategies, such as co-assembly, when we assembly all samples together, and individual assembly, when we assembly samples one by one
  • different tools like MetaSPAdes and MEGAHIT

Once the choices made, metagenomic assembly can start:

  1. Input data are assembled to obtain contigs and sometimes scaffolds
  2. Assembly quality is evaluated with various metrics
  3. The assembly graph can be visualized.

Once all these steps done, we can move to the next phase to build Metagenomics Assembled Genomes (MAGs): binning