Screening assembled genomes for contamination using NCBI FCS

Author(s) Eric Tvedte avatar Eric Tvedte
Editor(s) orcid logoHelena Rasche avatar Helena Rasche
Overview
Creative Commons License: CC-BY Questions:
  • Are the sequences in a genome assembly contaminant-free?

Objectives:
  • Learn how to screen a genome assembly for adaptor and vector contamination.

  • Learn how to screen a genome assembly for non-host organism contamination.

Requirements:
Time estimation: 90 minutes
Supporting Materials:
Published: Apr 16, 2024
Last modification: Apr 16, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00439
version Revision: 2

The National Center for Biotechnology Information (NCBI) performs contamination screening of genome assemblies submitted to the archival GenBank. Advances in genome sequencing have accelerated the production of genome assemblies and their submission to public databases, necessitating high-performance screening tools. Since contaminants can lead to misleading conclusions about the biology of the organism in question (e.g. gene content, gene evolution, etc.), ideally contamination screening should be performed after the initial contig assembly and prior to downstream genome analyses.

NCBI has released a publicly-available Foreign Contamination Screen (FCS) tool suite to detect contaminants from various sources and produce a cleaned sequence set. This tutorial provides a quick example of two current FCS tools: FCS-adaptor identifies synthetic sequences used in library preparation, and FCS-GX (Astashyn et al. 2024) identifies sequences from foreign organisms assigned to discordant taxonomies compared to the user-declared source organism.

Agenda

In this tutorial, we will cover:

  1. Retrieving the data
    1. Upload the genome FASTA from Zenodo
    2. Confirm dataset upload
  2. Importing Workflows
  3. Contamination screening
    1. Running the NCBI FCS Workflow
    2. Interpreting FCS Output
  4. Conclusion

Retrieving the data

FCS operates on assembled genome sequences and is not intended for use on raw reads. The following tutorial uses an assembled genome from yeast (Saccharomyces cerevisiae) with contaminants artificially inserted into the genome. The first step is to retrieve the genome FASTA.

Upload the genome FASTA from Zenodo

The following steps provide instructions to upload the test dataset into your Galaxy instance.

Hands-on: Data Upload
  1. Create a new history for this tutorial

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Copy the datasets URLs into clipboard. Click on the copy button in the right upper corner of the box below.
    https://zenodo.org/records/10932013/files/FCS_combo_test.fa
    
  3. Upload fasta.gz dataset into Galaxy

    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

Confirm dataset upload

Your data should look like this:

Galaxy History with FCS.combo.test.fa showing in green, successfully uploaded.Open image in new tab

Figure 1: Galaxy History with Uploaded Data.

Importing Workflows

Next we will import a Galaxy workflow - a chain of tools to perform a set of operations on a user-supplied input. Specificially in this workflow, we will:

  1. Screen the genome for foreign organism sequences using FCS-GX screen mode.
  2. Produce a cleaned set of contigs using FCS-GX clean mode.
  3. Screen the genome for synthetic sequences using FCS-adaptor and remove identified contaminants.
  4. Produce a final set of cleaned contigs using a second iteration of FCS-GX clean mode.
The workflow pictured has five steps, input data is provided to FCS-GX screen, and both of these are inputs to FCS-GX clean. This output is fed to FCS-adapter screen, and both of these outputs are fed to FCS-adapter clean. Open image in new tab

Figure 2: NCBI Foreign Contamination Screen Galaxy Workflow.
Hands-on: Importing Galaxy Workflows
  1. Ensure you are logged in to Galaxy

  2. Import the workflow into Galaxy

    Hands-on: Importing and launching a GTN workflow
    Launch NCBI Foreign Contamination Screen Workflow (View on GitHub, Download workflow) workflow.
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on galaxy-upload Import at the top-right of the screen
    • Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/ncbi-fcs/workflows/NCBI-Foreign-Contamination-Screen.ga
    • Click the Import workflow button

    Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

    Video: Importing a workflow from URL

    Warning: Log in to Galaxy

    If the workflow failed to import, it is usually because you are not logged in.

Once the workflow is loaded, your workflow page should look like this:

Galaxy Workflows page with NCBI Foreign Contamination Screen workflow loaded and ready to run.Open image in new tab

Figure 3: Galaxy Workflows page with NCBI Foreign Contamination Screen.

Contamination screening

Running the NCBI FCS Workflow

Next you will run the Galaxy workflow. Here you will configure the parameters for the FCS run according to the target organism.

Hands-on: Running the NBCI FCS Workflow
  1. Collect inputs
    1. Assembled genome in fasta/fasta.gz format
    2. Taxonomic information for source organism
  2. Launch the Workflow
    1. Click the Workflow menu located in the top menu bar
    2. Click the workflow-run button located in the NCBI Foreign Contamination Screen Workflow box
    3. Click “Expand to full workflow form.”
    4. In the param-file 1: Input genome menu:
      • param-file “Input file (Fasta file)”: 1:FCS.combo.test.fa
    5. In the 2: FCS-GX screen menu:
      • Select the appropriate taxonomic division for Taxonomy entry: galaxy-wf-edit div (fung:budding yeasts in this example).
      • Set the Advanced Options: galaxy-wf-edit Database location to /cvmfs/data.galaxyproject.org/byhand/ncbi_fcs_gx/all
    6. In the 4: FCS-adaptor screen menu:
      • Select the appropriate taxonomy for galaxy-wf-edit Choose the taxonomy (Eukaryotes in this example).
    7. Run the workflow
    8. Take a coffee break!

Interpreting FCS Output

Running NCBI FCS on Galaxy is dependent on loading a large reference database on an adequate host. Currently the database is not persistent, meaning run times may vary. Most runs should complete in around one hour.

After the workflow is completed, you will be able to visualize tables of identified contaminants and can access FASTA files of clean sequences separated from contaminants.

Hands-on: Reviewing FCS Contamination Reports
  1. Look at the FCS-GX contamination report

    1. In your Galaxy history, click the galaxy-eye icon for 3: NCBI FCS GX on data 1: Action report
    2. Confirm that the taxonomic division you specified for the workflow run appears in the metadata in the first row of the file. It should appear as the “asserted-div” and in the list of “inferred-primary-divs”.

      Warning: My organism's division is not in inferred-primary-divs

      If your target organism division is not in the set of inferred-primary-divs, it means that the assembly is likely heavily contaminated. FCS-GX will still call contaminants with respect to the taxonomy you define as the target organism (i.e. the primary division), but you should review the sequencing and assembly data for errors.

    3. Review the contaminated sequence list, the suggested contamination cleanup actions, and the taxonomic divisions assigned to contaminants
    Question
    1. What are the major contaminants in this genome assembly?
    2. When was this contamination likely introduced in the genome assembly process?
    1. Homo sapiens (human) and Pseudomonas aeruginosa. Note that for various reasons the actual contaminating genus/species isn’t reported as the top_tax_name.
    2. Contamination was likely introduced at the sample/library preparation stage. Pseudomonas aeruginosa can be found in multiple human tissues, including skin flora.
  2. Look at the FCS-GX cleaned sequences

    1. Click the box in your workflow history - 1: FCS.combo.test.fa. Note the total number of sequences.
    2. Click the box in your workflow history - 4: NCBI FCS GX on data 3 and data 1: Fasta for EXCLUDE entries. Note the total number of sequences.
    3. Click the box in your workflow history - 5: NCBI FCS GX on data 3 and data 1: Cleaned Fasta. Note the total number of sequences as well as Applied actions.
    4. Confirm that Cleaned Fasta sequences (Step 5) = Total sequences (Step 1) - Contaminant sequences (Step 4)
  3. Look at the FCS-adaptor contamination report

    1. In your Galaxy history, click the galaxy-eye icon for 6: NCBI FCS Adaptor on data 5: Adaptor report
    2. Review the contaminated sequence list, the suggested contamination cleanup actions, and the identity of adaptor/vector contaminants. Adaptor/vector sequences can be found on the NCBI UniVec FTP page.
    Question
    1. What are the adaptor/vector contaminants in this genome assembly?
    2. When was this contamination likely introduced in the genome assembly process?
    1. Illumina PCR Primer sequence.
    2. Contamination was likely introduced during contig assembly or polishing. Often the initial contig assembly is generated using long reads alone, so the introduction of contamination via polishing with untrimmed reads is the more likely explanation.
  4. Look at the FCS-adaptor cleaned sequences

    1. Click the box in your workflow history - 5: NCBI FCS GX on data 3 and data 1: Cleaned Fasta. Note the total number of sequences.
    2. Click the box in your workflow history - 8: NCBI FCS GX on data 6 and data 5: Fasta for EXCLUDE entries. Note the total number of sequences as well as Applied actions.
    3. Click the box in your workflow history - 9: NCBI FCS GX on data 6 and data 5: Cleaned Fasta. Note the total number of sequences.
Warning: Two different cleaned FASTA files from FCS-adaptor step

FCS-adaptor only removes sequences that are completely/mostly adaptor (assigned ACTION_EXCLUDE) and adaptors near contig ends (assigned ACTION_TRIM). FCS-adaptor does not remove small internal trims in large sequences. In this example, the FASTA file in workflow history - 7: NCBI FCS Adaptor on data 5: Cleaned Fasta is the same as the uncleaned input FASTA. FCS-GX clean is required to handle these internal spans. The FASTA file in workflow history - 9: NCBI FCS GX on data 6 and data 5: Cleaned Fasta has the contigs seq_1 through seq_16 split into two separate contigs at the ACTION_TRIM sites. If you want to hardmask these regions instead, you must download the tabular adaptor report, convert ACTION_TRIM values to FIX, and run the FCS-GX clean tool separately.

Conclusion

In this example, we removed 200 contaminant sequences from non-target organisms using FCS-GX and removed 16 internal adaptor sequences using FCS-adaptor, producing a cleaned yeast genome. Following contamination detection, we can perform separate validation checks to support the removal of these sequences from our assembly. One example is to perform BLAST searches or NUCmer alignments against reference genomes of the same or closely related species.