Screening assembled genomes for contamination using NCBI FCS

Author(s)	Eric Tvedte
Editor(s)	Helena Rasche
Reviewers

Overview
Questions:

Are the sequences in a genome assembly contaminant-free?

Objectives:

Learn how to screen a genome assembly for adaptor and vector contamination.

Learn how to screen a genome assembly for non-host organism contamination.

Requirements:

Introduction to Galaxy Analyses

Time estimation: 1 hour 30 minutes

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.org (Main) ✅ ⭐️

GalaxyTrakr ✅

Published: Apr 16, 2024

Last modification: Apr 16, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00439

rating Rating: 4.0 (1 recent ratings, 1 all time)

version Revision: 2

The National Center for Biotechnology Information (NCBI) performs contamination screening of genome assemblies submitted to the archival GenBank. Advances in genome sequencing have accelerated the production of genome assemblies and their submission to public databases, necessitating high-performance screening tools. Since contaminants can lead to misleading conclusions about the biology of the organism in question (e.g. gene content, gene evolution, etc.), ideally contamination screening should be performed after the initial contig assembly and prior to downstream genome analyses.

NCBI has released a publicly-available Foreign Contamination Screen (FCS) tool suite to detect contaminants from various sources and produce a cleaned sequence set. This tutorial provides a quick example of two current FCS tools: FCS-adaptor identifies synthetic sequences used in library preparation, and FCS-GX (Astashyn et al. 2024) identifies sequences from foreign organisms assigned to discordant taxonomies compared to the user-declared source organism.

Agenda

In this tutorial, we will cover:

Retrieving the data

Upload the genome FASTA from Zenodo

Confirm dataset upload

Importing Workflows

Contamination screening

Running the NCBI FCS Workflow

Interpreting FCS Output

Conclusion

Retrieving the data

FCS operates on assembled genome sequences and is not intended for use on raw reads. The following tutorial uses an assembled genome from yeast (Saccharomyces cerevisiae) with contaminants artificially inserted into the genome. The first step is to retrieve the genome FASTA.

Upload the genome FASTA from Zenodo

The following steps provide instructions to upload the test dataset into your Galaxy instance.

Hands On: Data Upload
Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:
Copy the datasets URLs into clipboard. Click on the copy button in the right upper corner of the box below.
https://zenodo.org/records/10932013/files/FCS_combo_test.fa
Upload fasta.gz dataset into Galaxy

Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import

Confirm dataset upload

Your data should look like this:

Galaxy History with FCS.combo.test.fa showing in green, successfully uploaded. — **Figure 1**: Galaxy History with Uploaded Data.

Importing Workflows

Next we will import a Galaxy workflow - a chain of tools to perform a set of operations on a user-supplied input. Specificially in this workflow, we will:

Screen the genome for foreign organism sequences using FCS-GX screen mode.
Produce a cleaned set of contigs using FCS-GX clean mode.
Screen the genome for synthetic sequences using FCS-adaptor and remove identified contaminants.
Produce a final set of cleaned contigs using a second iteration of FCS-GX clean mode.

The workflow pictured has five steps, input data is provided to FCS-GX screen, and both of these are inputs to FCS-GX clean. This output is fed to FCS-adapter screen, and both of these outputs are fed to FCS-adapter clean. — **Figure 2**: NCBI Foreign Contamination Screen Galaxy Workflow.

Hands On: Importing Galaxy Workflows

Ensure you are logged in to Galaxy

Import the workflow into Galaxy

Hands On: Importing and launching a GTN workflow

Launch NCBI Foreign Contamination Screen Workflow (View on GitHub, Download workflow) workflow.

Click to Launch NCBI Foreign Contamination Screen Workflow (View on GitHub, Download workflow)

Click on galaxy-workflows-activity Workflows in the Galaxy activity bar (on the left side of the screen, or in the top menu bar of older Galaxy instances). You will see a list of all your workflows

Click on galaxy-upload Import at the top-right of the screen

Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/ncbi-fcs/workflows/NCBI-Foreign-Contamination-Screen.ga

Click the Import workflow button

Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

Video: Importing a workflow from URL

Warning: Log in to Galaxy

If the workflow failed to import, it is usually because you are not logged in.

Once the workflow is loaded, your workflow page should look like this:

Galaxy Workflows page with NCBI Foreign Contamination Screen workflow loaded and ready to run. — **Figure 3**: Galaxy Workflows page with NCBI Foreign Contamination Screen.

Contamination screening

Running the NCBI FCS Workflow

Next you will run the Galaxy workflow. Here you will configure the parameters for the FCS run according to the target organism.

Hands On: Running the NBCI FCS Workflow

Collect inputs

Assembled genome in fasta/fasta.gz format

Taxonomic information for source organism

Launch the Workflow

Click the Workflow menu located in the top menu bar

Click the workflow-run button located in the NCBI Foreign Contamination Screen Workflow box

Click “Expand to full workflow form.”

In the param-file 1: Input genome menu:

param-file “Input file (Fasta file)”: 1:FCS.combo.test.fa

In the 2: FCS-GX screen menu:

Select the appropriate taxonomic division for Taxonomy entry: galaxy-wf-edit div (fung:budding yeasts in this example).

Set the Advanced Options: galaxy-wf-edit Database location to /cvmfs/data.galaxyproject.org/byhand/ncbi_fcs_gx/all

In the 4: FCS-adaptor screen menu:

Select the appropriate taxonomy for galaxy-wf-edit Choose the taxonomy (Eukaryotes in this example).

Run the workflow

Take a coffee break!

Interpreting FCS Output

Running NCBI FCS on Galaxy is dependent on loading a large reference database on an adequate host. Currently the database is not persistent, meaning run times may vary. Most runs should complete in around one hour.

After the workflow is completed, you will be able to visualize tables of identified contaminants and can access FASTA files of clean sequences separated from contaminants.

Hands On: Reviewing FCS Contamination Reports

Look at the FCS-GX contamination report

In your Galaxy history, click the galaxy-eye icon for 3: NCBI FCS GX on data 1: Action report

Confirm that the taxonomic division you specified for the workflow run appears in the metadata in the first row of the file. It should appear as the “asserted-div” and in the list of “inferred-primary-divs”.

Warning: My organism's division is not in inferred-primary-divs

If your target organism division is not in the set of inferred-primary-divs, it means that the assembly is likely heavily contaminated. FCS-GX will still call contaminants with respect to the taxonomy you define as the target organism (i.e. the primary division), but you should review the sequencing and assembly data for errors.

Review the contaminated sequence list, the suggested contamination cleanup actions, and the taxonomic divisions assigned to contaminants

Question

What are the major contaminants in this genome assembly?

When was this contamination likely introduced in the genome assembly process?

Homo sapiens (human) and Pseudomonas aeruginosa. Note that for various reasons the actual contaminating genus/species isn’t reported as the top_tax_name.

Contamination was likely introduced at the sample/library preparation stage. Pseudomonas aeruginosa can be found in multiple human tissues, including skin flora.

Look at the FCS-GX cleaned sequences

Click the box in your workflow history - 1: FCS.combo.test.fa. Note the total number of sequences.

Click the box in your workflow history - 4: NCBI FCS GX on data 3 and data 1: Fasta for EXCLUDE entries. Note the total number of sequences.

Click the box in your workflow history - 5: NCBI FCS GX on data 3 and data 1: Cleaned Fasta. Note the total number of sequences as well as Applied actions.

Confirm that Cleaned Fasta sequences (Step 5) = Total sequences (Step 1) - Contaminant sequences (Step 4)

Look at the FCS-adaptor contamination report

In your Galaxy history, click the galaxy-eye icon for 6: NCBI FCS Adaptor on data 5: Adaptor report

Review the contaminated sequence list, the suggested contamination cleanup actions, and the identity of adaptor/vector contaminants. Adaptor/vector sequences can be found on the NCBI UniVec FTP page.

Question

What are the adaptor/vector contaminants in this genome assembly?

When was this contamination likely introduced in the genome assembly process?

Illumina PCR Primer sequence.

Contamination was likely introduced during contig assembly or polishing. Often the initial contig assembly is generated using long reads alone, so the introduction of contamination via polishing with untrimmed reads is the more likely explanation.

Look at the FCS-adaptor cleaned sequences

Click the box in your workflow history - 5: NCBI FCS GX on data 3 and data 1: Cleaned Fasta. Note the total number of sequences.

Click the box in your workflow history - 8: NCBI FCS GX on data 6 and data 5: Fasta for EXCLUDE entries. Note the total number of sequences as well as Applied actions.

Click the box in your workflow history - 9: NCBI FCS GX on data 6 and data 5: Cleaned Fasta. Note the total number of sequences.

Warning: Two different cleaned FASTA files from FCS-adaptor step

FCS-adaptor only removes sequences that are completely/mostly adaptor (assigned ACTION_EXCLUDE) and adaptors near contig ends (assigned ACTION_TRIM). FCS-adaptor does not remove small internal trims in large sequences. In this example, the FASTA file in workflow history - 7: NCBI FCS Adaptor on data 5: Cleaned Fasta is the same as the uncleaned input FASTA. FCS-GX clean is required to handle these internal spans. The FASTA file in workflow history - 9: NCBI FCS GX on data 6 and data 5: Cleaned Fasta has the contigs seq_1 through seq_16 split into two separate contigs at the ACTION_TRIM sites. If you want to hardmask these regions instead, you must download the tabular adaptor report, convert ACTION_TRIM values to FIX, and run the FCS-GX clean tool separately.

Conclusion

In this example, we removed 200 contaminant sequences from non-target organisms using FCS-GX and removed 16 internal adaptor sequences using FCS-adaptor, producing a cleaned yeast genome. Following contamination detection, we can perform separate validation checks to support the removal of these sequences from our assembly. One example is to perform BLAST searches or NUCmer alignments against reference genomes of the same or closely related species.

You've Finished the Tutorial

Key points

Contamination can arise from multiple sources and occur at multiple stages of a genome assembly project.

Contamination cleanup is essential for reliable comparative genomics analyses.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Astashyn, A., E. S. Tvedte, D. Sweeney, V. Sapojnikov, N. Bouk et al., 2024 Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biology 25: 10.1186/s13059-024-03198-7

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Eric Tvedte, Screening assembled genomes for contamination using NCBI FCS (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/ncbi-fcs/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{sequence-analysis-ncbi-fcs,
author = "Eric Tvedte",
	title = "Screening assembled genomes for contamination using NCBI FCS (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/ncbi-fcs/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

NIH

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/sequence-analysis/tutorials/ncbi-fcs/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: ncbi_fcs_gx
  owner: iuc
  revisions: 49f8eae39606
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_fcs_adaptor
  owner: richard-burhans
  revisions: d55ad8533d58
  tool_panel_section_label: Sequence Contamination Filtering
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 1

February 2025

4 stars: Liked: The tutorial is easy to follow and 100% understandable. Disliked: Dear Galaxy Team, First and foremost, thanks for this wonderful instrument in general and for the implementation of NCBI FCS pipeline, which turned out to be tricky to install locally for me for some reason. I have a suggestion for the NCBI FCS Adaptor tool. I've been getting this error and finally figured out what's wrong: I had to shorten the names to below 50 symbols, otherwise it crashed with a pretty vague error. I might have missed something, but in fact I could only guessed about 50 symbols because I gave up and tried to submit my assembly to NCBI as is, and there I read about this 50-symbol name length threshold. Some, my suggestion would be to mention this pre-requisite when talking about input file format (ideally both in the tutorial and in the tool dialog), so that people do not get this frustrating experience when everything works with example data and nothing works with real data. Thanks again!