View markdown source on GitHub

An introduction to get started in genome assembly and annotation

Contributors

Questions

last_modification Last modification: Nov 29, 2022

The truth about bioinformatics

.image-100[ Bioinformatics is not just about pushing a button and getting your result ]


Let’s start with some important definitions

.image-40[ Importance to speak the same language ]


.left[ Contig: a contiguous sequence in an assembly. A contig does not contain long stretches of unknown sequences (aka assembly gaps).

Scaffold: a sequence consists of one or multiple contigs connected by assembly gaps of typically inexact sizes. A scaffold is also called a supercontig, though this terminology is rarely used nowadays.

Assembly: a set of contigs or scaffolds. ]

.image-60[ Illustration of the working principle of scaffolding ]


.left[ Haplotig: a contig that comes from the same haplotype. In an unphased assembly, a contig may join alleles from different parental haplotypes in a diploid or polyploid genome.

Primary assembly: a complete assembly with long stretches of phased blocks.

Alternate assembly: an incomplete assembly consisting of haplotigs in heterozygous regions. An alternate assembly always accompanies a primary assembly. It is not useful by itself as it is fragmented and incomplete.

Haplotype-resolved assembly: sets of complete assemblies consisting of haplotigs, representing an entire diploid/polyploid genome. ]

.image-60[ Illustration of the assembly types ]


.left[ Telomere to telomere: An assembly where each chromosome is fully phased and assembled without gaps.

Linkage group: a set of contigs or scaffolds ordered and oriented using a collection of genes that are inferred to be located together on a single chromosome because of the pattern of their inheritance. ]

.image-20[ Example of linkage map from JCVI ]


.left[ Coverage in terms of redundancy (A): number of reads that align to, or “cover,” a known reference. It describes how often, in average, a reference sequence is covered by bases from the reads.

Coverage in terms of the percentage coverage of a reference by reads (B): E.g. if 90% of a reference is covered by reads (and 10% not) it is a 90% coverage.

Sequencing depth (C): total number of usable reads from the sequencing machine. ]

.image-80[ Illustration of coverage and depth ]


Assembly and annotation in a ideal world

.image-100[ The perfect assembly dream ]


Key concepts for assembly and annotation

.image-100[ Assembly and annotation overview ]


Steps before starting a genome project

.left[


Build a wide community for the project (if it’s possible)

.left[ The aim of a genome project is to sequence the entire target genome for a wide range of genomics applications. ]

.left[ Analyses, reanalyses and integration of genomic and other phenotype information are required: ]

.left[ warning The cost of data storage, maintenance, transfer, and analysis are likely to be significant and will represent an increasing proportion of overall sequencing costs in the future. ]


Genome information: Genome size

.pull-left[ How to collect informations?

.pull-right[ .image-100[ variation in estimated genome sizes in base pairs ]]

.footnote[https://commons.wikimedia.org/w/index.php?curid=19537795]


Genome information: GC content

.pull-left[ Why?

.left[ Sequencing is not random! GC and AT rich regions are under-represented. ]

How to solve?

.pull-right[ .image-100[ Sequencing coverage by GC content ]]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]


Genome information: Ploidy level

.pull-right[ .image-55[ Heterozigous genotype ]]

.pull-left[

Ploidy (N):

Number of sets of chromosomes in a cell

Organism Ploidy
Bacteria 1N
Human, mouse, rat 2N
Amphibians (Xenopus) 2N to 12N
Plants (wheat) 6N
Autopolyploid .
Hybrids .

]

Higher ploidy -> harder to assemble => Increase of sequencing depth

.footnote[Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011).]


Genome information: Heterozygosity level

.pull-left[ .left[ Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus ]]

.pull-right[ .image-100[ Heterozigous genotype ]]

.left[ Heterozygosity is a metric used to indicate the probability that an individual is heterozygous for a particular allele ]

Higher heterozygosity -> harder to assemble => Increase of sequencing depth

.footnote[https://www.genome.gov/genetics-glossary/heterozygous]


Genome information: Heterozygosity level

.image-125[ Concepts in phased assemblies ]

.footnote[Heng Li’s blog: lh3.github.io/2021/04/17/concepts-in-phased-assemblies]


Genome information: Complexity aka repeats elements

.left[ It is impossible to resolve repeats of length L unless you have reads longer than L ]

Most common source of assembly errors:

.pull-left[ .image-65[ Collapsed consensus from repeat copies ]]

.pull-right[ .image-65[ Collapsed, excision and rearrangement consensus ]]


Genome information: Others


Genome information: Tips

.pull-left[


The best possible DNA

.left[ Select the best possible DNA source and extraction method. The extraction of high-quality DNA is the most important aspect of a successful genome project

The lack of a good starting material will limit the choice of sequencing technology and affect the quality of data obtained ]


The best possible DNA: Chemical purity of DNA

.left[ Sample-related contaminants:

All these contaminants can affect the efficiency of library preparation, regardless of the technology, and this is especially true for PCR-free libraries (PacBio and ONT) ]


The best possible DNA: Quantity of DNA

.left[ Different technologies require different amount of DNA:


The best possible DNA: Structural integrity of DNA

.left[ High Molecular Weight (HMW) for Nanopore/PacBio (obtained mainly from fresh material) ]


The best possible DNA: Tips

.left[


Appropriate sequencing technology

.left[ This mainly depends on the quantity and quality of DNA as well as the cost of the experiment but many parameters need to be considered before performing an NGS experiment:


Appropriate sequencing technology: Assembly

.left[


Appropriate sequencing technology: Scaffolding

.left[


Appropriate sequencing technology

.image-100[ Several sequencing technologies ]

.footnote[Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genetics 20, 631–656 (2019).]


Appropriate sequencing technology: Short vs long reads

.pull-left[ Short reads platforms: Highest sequencing depth but shorter reads ]

.pull-rigth[ .image-40[ Reads accuracy distribution ]]

.pull-left[ Long reads platforms: Longer reads but less sequencing depth ]

.pull-rigth[ .image-40[ Reads accuracy distribution ]]

.footnote[Kanzi, A. M. et al. Next Generation Sequencing and Bioinformatics Analysis of Family Genetic Inheritance. Frontiers Genetics 11, 544162 (2020).]


Appropriate sequencing technology: Short vs long reads

.pull-left[ Reads accuracy differs depending on the sequencing technology:

.pull-rigth[ .image-40[ Reads accuracy distribution ]]


Appropriate sequencing technology: Coverage versus depth

.left[ Coverage in terms of redundancy

Coverage in terms of the percentage coverage of a reference by reads

Intuitively, increase sequencing depth should increase both types of coverage. ]

.image-40[ Sequencing coverage by GC content ]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]


Computational resources and requirements

.left[ To be successful, you must have sufficient computing resources (CPUS, RAM, walltime and storage).


Typical sequencing strategies: Bacterial genomes

.left[


Typical sequencing strategies: Larger genomes

.left[


Bioinformatics data formats

.left[ FASTA: a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. ]

.image-100[ Fasta format description ] Image licensed CC-BY 4.0 Hosseini et al. 2016

.footnote[Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).]


Bioinformatics data formats

.left[ FASTQ: a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores (Phred). Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. It’s the standard sequencing output for Illumina and MGI sequencers. ]

.image-100[ Fastq format description ] Image licensed CC-BY 4.0 Hosseini et al. 2016

.footnote[Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).]


Bioinformatics data formats

.left[ FAST5: the standard sequencing output for Oxford Nanopore sequencers. It is based on the hierarchical data format HDF5 format which enables storage of large and complex data. In contrast to fasta and fastq files a FAST5 file is binary and can not be opened with a normal text editor. Data stored in nanopore FAST5 files can contain the sequence of a read in fastq format (after basecalling), the raw signal of the pore as well as several log files and other information ]

.image-100[ Interactive view of a Fast5 with HDFview ]


Bioinformatics data formats

SAM (Sequence Alignment Map): a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.

BAM (Binary Alignment Map): the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the SAM format. It’s the standard sequencing output for PacBio sequencers.

CRAM (Compressed Reference-oriented Alignment Map): a compressed columnar file format for storing biological sequences aligned to a reference sequence.

.pull-left[ .image[ SAM format description ]]

.pull-right[ Image licensed CC-BY 4.0 Hosseini et al. 2016 ]

.footnote[Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).]


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.

References

  1. Hosseini, M., D. Pratas, and A. Pinho, 2016 A Survey on Data Compression Methods for Biological Sequences. Information 7: 56. 10.3390/info7040056

Funding

These individuals or organisations provided funding support for the development of this resource

This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics. eu flag with the text: with the support of the erasmus programme of the european union