Screening assembled genomes for contamination using NCBI FCS
Author(s) | Eric Tvedte |
Editor(s) | Helena Rasche |
Reviewers |
OverviewQuestions:Objectives:
Are the sequences in a genome assembly contaminant-free?
Requirements:
Learn how to screen a genome assembly for adaptor and vector contamination.
Learn how to screen a genome assembly for non-host organism contamination.
Time estimation: 90 minutesSupporting Materials:Published: Apr 16, 2024Last modification: Apr 16, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00439version Revision: 2
The National Center for Biotechnology Information (NCBI) performs contamination screening of genome assemblies submitted to the archival GenBank. Advances in genome sequencing have accelerated the production of genome assemblies and their submission to public databases, necessitating high-performance screening tools. Since contaminants can lead to misleading conclusions about the biology of the organism in question (e.g. gene content, gene evolution, etc.), ideally contamination screening should be performed after the initial contig assembly and prior to downstream genome analyses.
NCBI has released a publicly-available Foreign Contamination Screen (FCS) tool suite to detect contaminants from various sources and produce a cleaned sequence set. This tutorial provides a quick example of two current FCS tools: FCS-adaptor identifies synthetic sequences used in library preparation, and FCS-GX (Astashyn et al. 2024) identifies sequences from foreign organisms assigned to discordant taxonomies compared to the user-declared source organism.
AgendaIn this tutorial, we will cover:
Retrieving the data
FCS operates on assembled genome sequences and is not intended for use on raw reads. The following tutorial uses an assembled genome from yeast (Saccharomyces cerevisiae) with contaminants artificially inserted into the genome. The first step is to retrieve the genome FASTA.
Upload the genome FASTA from Zenodo
The following steps provide instructions to upload the test dataset into your Galaxy instance.
Hands-on: Data Upload
Create a new history for this tutorial
To create a new history simply click the new-history icon at the top of the history panel:
- Copy the datasets URLs into clipboard. Click on the copy button in the right upper corner of the box below.
https://zenodo.org/records/10932013/files/FCS_combo_test.fa
Upload
fasta.gz
dataset into Galaxy
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- “Select history”: the history you want to import the data to (or create a new one)
- Click on Import
Confirm dataset upload
Your data should look like this:
Importing Workflows
Next we will import a Galaxy workflow - a chain of tools to perform a set of operations on a user-supplied input. Specificially in this workflow, we will:
- Screen the genome for foreign organism sequences using FCS-GX
screen
mode. - Produce a cleaned set of contigs using FCS-GX
clean
mode. - Screen the genome for synthetic sequences using FCS-adaptor and remove identified contaminants.
- Produce a final set of cleaned contigs using a second iteration of FCS-GX
clean
mode.
Hands-on: Importing Galaxy Workflows
Ensure you are logged in to Galaxy
Import the workflow into Galaxy
Hands-on: Importing and launching a GTN workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on galaxy-upload Import at the top-right of the screen
- Paste the following URL into the box labelled “Archived Workflow URL”:
https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/ncbi-fcs/workflows/NCBI-Foreign-Contamination-Screen.ga
- Click the Import workflow button
Below is a short video demonstrating how to import a workflow from GitHub using this procedure:
Warning: Log in to GalaxyIf the workflow failed to import, it is usually because you are not logged in.
Once the workflow is loaded, your workflow page should look like this:
Contamination screening
Running the NCBI FCS Workflow
Next you will run the Galaxy workflow. Here you will configure the parameters for the FCS run according to the target organism.
Hands-on: Running the NBCI FCS Workflow
- Collect inputs
- Assembled genome in
fasta/fasta.gz
format- Taxonomic information for source organism
- Launch the Workflow
- Click the Workflow menu located in the top menu bar
- Click the workflow-run button located in the NCBI Foreign Contamination Screen Workflow box
- Click “Expand to full workflow form.”
- In the param-file 1: Input genome menu:
- param-file “Input file (Fasta file)”:
1:FCS.combo.test.fa
- In the 2: FCS-GX screen menu:
- Select the appropriate taxonomic division for Taxonomy entry: galaxy-wf-edit div (
fung:budding yeasts
in this example).- Set the Advanced Options: galaxy-wf-edit Database location to
/cvmfs/data.galaxyproject.org/byhand/ncbi_fcs_gx/all
- In the 4: FCS-adaptor screen menu:
- Select the appropriate taxonomy for galaxy-wf-edit Choose the taxonomy (
Eukaryotes
in this example).- Run the workflow
- Take a coffee break!
Interpreting FCS Output
Running NCBI FCS on Galaxy is dependent on loading a large reference database on an adequate host. Currently the database is not persistent, meaning run times may vary. Most runs should complete in around one hour.
After the workflow is completed, you will be able to visualize tables of identified contaminants and can access FASTA files of clean sequences separated from contaminants.
Hands-on: Reviewing FCS Contamination Reports
Look at the FCS-GX contamination report
- In your Galaxy history, click the galaxy-eye icon for 3: NCBI FCS GX on data 1: Action report
Confirm that the taxonomic division you specified for the workflow run appears in the metadata in the first row of the file. It should appear as the “asserted-div” and in the list of “inferred-primary-divs”.
Warning: My organism's division is not in inferred-primary-divsIf your target organism division is not in the set of inferred-primary-divs, it means that the assembly is likely heavily contaminated. FCS-GX will still call contaminants with respect to the taxonomy you define as the target organism (i.e. the primary division), but you should review the sequencing and assembly data for errors.
- Review the contaminated sequence list, the suggested contamination cleanup actions, and the taxonomic divisions assigned to contaminants
Question
- What are the major contaminants in this genome assembly?
- When was this contamination likely introduced in the genome assembly process?
- Homo sapiens (human) and Pseudomonas aeruginosa. Note that for various reasons the actual contaminating genus/species isn’t reported as the top_tax_name.
- Contamination was likely introduced at the sample/library preparation stage. Pseudomonas aeruginosa can be found in multiple human tissues, including skin flora.
Look at the FCS-GX cleaned sequences
- Click the box in your workflow history - 1: FCS.combo.test.fa. Note the total number of sequences.
- Click the box in your workflow history - 4: NCBI FCS GX on data 3 and data 1: Fasta for EXCLUDE entries. Note the total number of sequences.
- Click the box in your workflow history - 5: NCBI FCS GX on data 3 and data 1: Cleaned Fasta. Note the total number of sequences as well as Applied actions.
- Confirm that Cleaned Fasta sequences (Step 5) = Total sequences (Step 1) - Contaminant sequences (Step 4)
Look at the FCS-adaptor contamination report
- In your Galaxy history, click the galaxy-eye icon for 6: NCBI FCS Adaptor on data 5: Adaptor report
- Review the contaminated sequence list, the suggested contamination cleanup actions, and the identity of adaptor/vector contaminants. Adaptor/vector sequences can be found on the NCBI UniVec FTP page.
Question
- What are the adaptor/vector contaminants in this genome assembly?
- When was this contamination likely introduced in the genome assembly process?
- Illumina PCR Primer sequence.
- Contamination was likely introduced during contig assembly or polishing. Often the initial contig assembly is generated using long reads alone, so the introduction of contamination via polishing with untrimmed reads is the more likely explanation.
Look at the FCS-adaptor cleaned sequences
- Click the box in your workflow history - 5: NCBI FCS GX on data 3 and data 1: Cleaned Fasta. Note the total number of sequences.
- Click the box in your workflow history - 8: NCBI FCS GX on data 6 and data 5: Fasta for EXCLUDE entries. Note the total number of sequences as well as Applied actions.
- Click the box in your workflow history - 9: NCBI FCS GX on data 6 and data 5: Cleaned Fasta. Note the total number of sequences.
Warning: Two different cleaned FASTA files from FCS-adaptor stepFCS-adaptor only removes sequences that are completely/mostly adaptor (assigned
ACTION_EXCLUDE
) and adaptors near contig ends (assignedACTION_TRIM
). FCS-adaptor does not remove small internal trims in large sequences. In this example, the FASTA file in workflow history - 7: NCBI FCS Adaptor on data 5: Cleaned Fasta is the same as the uncleaned input FASTA. FCS-GXclean
is required to handle these internal spans. The FASTA file in workflow history - 9: NCBI FCS GX on data 6 and data 5: Cleaned Fasta has the contigs seq_1 through seq_16 split into two separate contigs at theACTION_TRIM
sites. If you want to hardmask these regions instead, you must download the tabular adaptor report, convertACTION_TRIM
values toFIX
, and run the FCS-GXclean
tool separately.
Conclusion
In this example, we removed 200 contaminant sequences from non-target organisms using FCS-GX and removed 16 internal adaptor sequences using FCS-adaptor, producing a cleaned yeast genome. Following contamination detection, we can perform separate validation checks to support the removal of these sequences from our assembly. One example is to perform BLAST searches or NUCmer alignments against reference genomes of the same or closely related species.