Decontamination of a genome assembly
Author(s) | Delphine Lariviere |
Reviewers |
OverviewQuestions:Objectives:
How to remove contaminant sequences from your assembly?
Requirements:
Remove contaminant sequences from an assembly
Identify contaminant species
Remove Mitochondrial DNA from an assembly
Time estimation: 1 hour 30 minutesSupporting Materials:Published: Sep 4, 2024Last modification: Nov 7, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00452rating Rating: 3.5 (4 recent ratings, 5 all time)version Revision: 3
When sequencing a genome, it is common that contamination from a foreign organism get mixed with the genomic material of our species of interest. For instance, if you are processing a whole body sample of an insect then you will sequence not only the insect, but everything on and inside of it. When building a reference genome, it is important to separate these contaminants from the genome of our species. Foreign DNA sequences could cause false positive identification when running BLAST analyses, the misidentification of genes that don’t actually belong to the species, or they can be incorporated as ‘reference’ sequence for that species in the public archives when those sequences did not actually belong to that species.
In this tutorial, we will learn how to separate these contaminants from a genome assembly of a vertebrate. It will also include the removal of mitogenome contigs so that the final genome assembly contains only nuclear DNA. (A separate tutorial exists for assembling the mitogenome.)
AgendaIn this tutorial, we will cover:
Get data
For this tutorial, we are using the genome assembly of the zebra finch (Taeniopygia guttata) generated by the Vertebrate Genome Project. We downsized the assembly to 50 of the largest scaffolds to keep the runtime reasonable, and added some contaminants for the sake of this tutorial.
Comment: Run this analysis on a "real" assemblyIf you want to run this analysis on a real assembly generated by the Vertebrate Genome Project, you can find the scaffolded assembly on Genome Ark as a remote repository and upload it to Galaxy (available on the three main Public Galaxy instances: .org, .eu, .org.au).
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a Choose remote files:
- Click on Upload Data on the top of the left panel
Click on Choose remote files and scroll down to find your data folder or type the folder name in the
search box
on the top.
- Look for your data under:
Genome Ark -> species -> Taeniopygia_guttata -> bTaeGut2 -> assembly_vgp_hic_2.0 -> bTaeGut2.hic.hap1.s2.fasta
- click on OK
- Click on Start
- Click on Close
- You can find the dataset has begun loading in you history.
The analysis described in this tutorial will work with any assembly in FASTA format. If your initial dataset is compressed, use the tool Convert compressed file to uncompressed.
first.
Hands-on: Data Upload
- Create a new history for this tutorial
Import the files from Zenodo:
https://zenodo.org/records/13367433/files/Contaminated_assembly.fasta
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename the dataset :
Contaminated Assembly
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Check that the datatype is
fasta
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
fasta
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Genome Masking
Genome masking refers to hiding regions of low complexity in the genome, such as repeats. Masking a genome reduces the noise and improves the efficacy of the decontamination by emphasizing the regions that contribute the most to the classification (Saini et al. 2016, Caetano-Anolles).
The first step of our analysis is therefore to mask our genome assembly. There are two types of masking, depending on the requirements of the tools we are using:
- Soft Masking The masked regions are replaced by lower case letters. This is the prefered method of masking when the tools understand it, because it does not cause any loss of information.
- Hard Masking The masked regions are replaced by a wildcard letter:
N
s for nucleotide sequences,X
s for proteins. Use this technique if you are unsure whether the tools you will be using knows to ignore lower case letters.
Repeat masking is important for a lot of analyses like sequence alignments, classification, gene annotation. Repetitive and low complexity regions cause problems for search and clustering algorithms based on patterns. For gene annotation, repeats can cause non-specific gene hits.
In this tutorial we will use Kraken2
to classify our contaminant scaffolds, Blastn
to identify the mitochondrial sequences, and DustMasker
to mask our genome. Dustmasker
generates soft masked sequences, while Kraken2
and Blast
needs hard masked sequences, so we will need to do some file manipulations along the way.
Before running the analysis, ensure the sequences in your assembly are uppercase. This will ensure the soft masking can be done properly. If your sequences are not uppercase, then run the optional step “Convert the assembly file to upper case”. We will then convert the lower case letter to N
s to hard mask the sequences for Kraken2
and Blast
.
Because soft masking uses lowercase to denote masked base pairs, the input sequence should not have lowercase sequences because then you cannot tell afterwards whether a sequence is lowercase because it started that way or because it was soft masked.
- Text transformation ( Galaxy version 1.1.1) with the following parameters:
- param-file “File to process”:
Contaminated assembly
- “SED Program”:
s/^[^>].+/\U&/g
(Convert letters in the sequence to uppercase)
Hands-on: Mask the Genome with Dustmasker
- NCBI BLAST+ dustmasker ( Galaxy version 2.14.1+galaxy1) with the following parameters:
- “Subject database/sequences”:
FASTA file from your history
- param-file “Nucleotide FASTA subject file to use instead of a database”:
Contaminated assembly
(or the output of Text transformation tool if you converted your sequences to upper case)- “DUST level”:
40
- “Output format”:
FASTA
- Rename your file
Soft Masked assembly
Take a peek at your genome assembly. You can see that the beginning of the first scaffold has been converted to lower case, which masks repeats corresponding to the telomeres in the sequences.
Hands-on: Soft Masking to Hard Masking
- Text transformation ( Galaxy version 1.1.1) with the following parameters:
- param-file “File to process”:
Soft Masked assembly
(output of NCBI BLAST+ dustmasker tool)- “SED Program”:
/^>/!y/atcgn/NNNNN/
: In sequence lines (not starting with “>”), replace each instance of lower case bases (a t c or g), including undefined ones (n), toN
s).Rename your file
Hard Masked assembly
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
Comment: Unsure about your regex?Ask ChatGPT about it. E.g. What is the sed command to convert a text to upper case? or test with a regex tester website (Warning, some of these testers do not understand the lowercase and uppercase transformations)
Take a look at your dataset and see that the lower case bases have been converted to N
s.
Identify Non-Target Contaminants
Kraken2 is a tool for taxonomic classification. It matches the kmers found in our assemblies to the ones stored in its databases. For the purpose of decontamination, we only need to identify possible contaminants, not the sequences belonging to our sampled organisms. Therefore, we only need to use a database containing likely contaminants such as bacteria, virus, and human (always wear gloves, folks!). The best option at the time of this tutorial’s writing is the PlusPF database which contains the RefSeq Standard (archaea, bacteria, viral, plasmid, human, UniVec_Core) plus protozoa and fungi data.
Hands-on: Identify the taxonomy of our sequences
- Kraken2 ( Galaxy version 2.1.1+galaxy1) with the following parameters:
- “Single or paired reads”:
Single
- param-file “Input sequences”:
Hard Masked assembly
(output of Text transformation tool)- “Print scientific names instead of just taxids”:
Yes
- “Confidence”:
0.3
- “Split classified and unclassified outputs?”:
Yes
- “Select a Kraken2 database”:
PlusPF (2021-05-17)
Kraken provides us with three outputs:
- A table containing the classification information for each sequence
- A fasta file containing the classified sequences (our contaminants)
- A fasta file containing the unclassified sequences (our precious zebra finch)
We could decide to use the fasta file of unclassified sequences as our assembly, but remember that our genome is hard masked. That means that we lost information. Because we don’t use it now doesn’t mean we never will. So instead of using the masked genome, we will remove the classified sequences from our original assembly.
But first, we are naturally curious people, and we want to know what type of contaminants we had in our sample.
Investigate our contaminants
Take a look at the classification output of Kraken2. You can see five columns containing:
C
/U
: a one letter code indicating that the sequence was either classified or unclassified.- The sequence ID.
- The taxonomy ID Kraken2 used to label the sequence; this is 0 if the sequence is unclassified.
- The length of the sequence in bp.
- A space-delimited list indicating the LCA mapping of each k-mer in the sequence(s). For example, “562:13 561:4 A:31 0:1” would indicate that:
- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database
To identify our contaminants, we only need the information from the first three columns, and only for the sequences that are classified. So we will start by extracting our three columns and then filtering out unclassified sequences.
Hands-on: Extract classification information
- Cut with the following parameters:
- “Cut columns”:
c1,c2,c3
- “Delimited by”:
Tab
- param-file “From”:
Kraken2 on data X: Classification
(output of Kraken2 tool)Comment: Columns definitionsThe three columns that we extracted contain:
- Whether the sequence is classified
C
or unclassifiedU
- The sequence name
- The sequence classification
Comment: What if my organism is in the database?The filtering of contaminants is based on their presence in the database. Any sequence classified by Kraken2 is considered as a contaminant. If you are working on an organism that is present in the database, this filtering will not work.
If you can’t find a database that include contaminants but exclude your organism, you will need to process the outputs of Kraken differently. Instead of filtering the sequences on classified/unclassified, select the sequences that do not match the taxonomic ID of your species. That is true for the rest of the section.
Hands-on: Remove unclassified sequences
- Filter with the following parameters:
- param-file “Filter”:
Cut on data X
(output of Cut tool)- “With following condition”:
c1!='U'
- Rename the output
Contaminants informations
QuestionHow many different contaminants are in our sample?
We found two contaminants in our sample: Salmonella and Yersinia phage phiR1-37.
Get the list of contaminants to remove from our assembly
In order to remove the contaminant sequences from our assembly, we need a list of the contaminants names, that we will get by extracting the contaminant IDs from the Kraken output.
Hands-on: Get contaminant sequence names
- Cut with the following parameters:
- “Cut columns”:
c2
- “Delimited by”:
Tab
- param-file “From”:
Contaminants informations
(output of Filter tool)- Rename the output
List of Contaminants
Now that we have our list of contaminants, we will go through a similar process to extract mitochondrial sequences.
Identify mitochondrial DNA
We start with running Blastn against a database containing mitochondrial sequences.
Hands-on: Blastn against RefSeq Mitochondrion
- NCBI BLAST+ blastn ( Galaxy version 2.14.1+galaxy1) with the following parameters:
- param-file “Nucleotide query sequence(s)”:
Hard Masked assembly
(output of Text transformation tool)- “Subject database/sequences”:
Locally installed BLAST database
- “Nucleotide BLAST database”:
RefSeq Mitochondrion
- “Type of BLAST”:
blastn - Traditional BLASTN requiring an exact match of 11, for somewhat similar sequences
- “Output format”:
Tabular (select which columns)
- “Standard columns”:
qseqid, sseqid, length, qstart, qend, evalue
- “Extended columns”:
qlen
- “Miscellaneous columns”:
qcovs, qcovhsp
- “Advanced Options”:
Hide Advanced Options
This step may take a little while, so you have time to make yourself a cup of coffee.
Next, we parse the result of the Blastn to extract a report on the alignements between our assemblies and the database, and a list of the scaffolds aligning with mitochondrial sequences.
Hands-on: Extract alignement informations from the Blast output
- Parse mitochondrial blast ( Galaxy version 1.0.2+galaxy0) with the following parameters:
- param-file “Tabular file generated by mito-blast”:
blastn Text transformation on data X vs 'refseq_mitochondrion'
(output of NCBI BLAST+ blastn tool)- Rename the output
List of Mitochondrial sequences
QuestionHow many scaffolds belonging to the mitochondria are in our sample?
We found two mitochondrial scaffolds in our assembly.
Remove contaminants and mitochondrial DNA from the assembly
From the previous sections, we obtain two files containing a list of sequences belonging to contaminants and one belonging to the mitochondria. We will merge these two lists to clean up our original assembly.
Hands-on: Concatenate lists of scaffolds to remove
- Concatenate datasets ( Galaxy version 0.1.1) with the following parameters:
- param-files “Datasets to concatenate”:
List of Contaminants
,List of Mitochondrial sequences
- Rename the output
Sequences to remove
Now we will use gfastats, a tool box for genome assemblies with capability for manipulations and statistics generation.
Hands-on: Remove scaffolds from the original genome assembly
- gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:
- param-file “Input file”:
output
(Input dataset)- “Specify target sequences”:
Enabled
- param-file “Exclude specific intervals”:
Sequences to remove
(output of Concatenate datasets tool). This input can be a bed file with sequence coordinates, or, in our case, a simple list of sequence IDs to remove from the fasta file.- “Tool mode”:
Genome assembly manipulation
- “Output format”:
FASTA.gz
Conclusion
In this tutorial we used taxonomic classification to identified contaminants and Blast to identify mitochondrial sequences in a genome assembly. We then removed these sequences to keep only the nuclear DNA of our species of interest.
Now that we have removed the contaminants and mitochondrial DNA from our assembly, it is ready for manual curation!