Data submission using ENA upload Tool

Overview
Creative Commons License: CC-BY Questions:
  • How to prepare sequences for submission to ENA?

  • How to upload raw sequences to ENA?

Objectives:
  • Manage sequencing files (ab1, FASTQ, FASTA, FASTQ.GZ)

  • Clean sequences in an automated and reproducible manner

  • Perform alignments for each sequence

  • Have the necessary sequence format to submit to ENA

  • Submit raw reads to ENA using the ENA upload Tool

Requirements:
Time estimation: 2 hours
Supporting Materials:
Published: Dec 13, 2024
Last modification: Dec 13, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

This tutorial will guide you through the necessary steps to manage and prepare sequencing files (ab1, FASTQ, FASTA) for submission to the genomic database ENA. This workflow will take you from raw sequences in AB1 format through all the necessary steps to integrate these sequences into the ENA genomic database. We will convert the files into FASTQ and FASTA formats after performing quality control. Additionally, we will perform alignments with the NCBI database to ensure the accuracy of your sequences.You will then need to fill a metadata Excel template to use the ENA upload Tool. The worklow is made of 17 Galaxy tools, we will present them and explain what they do. The goal is to present an accessible and reproductible workflow for data submission.

Agenda

In this tutorial, we will cover:

  1. Prepare raw data
    1. Tools used in the “Prepare Data submission” Workflow
  2. Cleaning the Data
    1. Cutadapt
    2. Quality Control with FastQC and MultiQC
    3. Filtering the collection
    4. Alignments on NCBI database
    5. Workflow Outputs
  3. How to use ENA upload Tool
    1. Adding ENA “Webin” credentials to your Galaxy user information
    2. Submitting using a metadata template file
  4. Conclusion

Prepare raw data

Hands-on: Data Upload
  1. Create a new history for this tutorial

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the raw sequences files.

    https://data.indores.fr/api/access/datafile/3673
    https://data.indores.fr/api/access/datafile/3609
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename galaxy-pencil your datafiles
    • 3673 becomes A2_RC_8F2_B.pl_HCOI.ab1
    • 3609 becomes A12_RC_9G4_B.md_HCOI.ab1
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

  4. Check the datatype
    • Make sure it is ab1, and change it if not.
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select your desired datatype from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Build a Collection containing these two files, you can ame i “ab1” for example

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Build Dataset List

      build list collection menu item

    • Enter a name for your collection
    • Click Create collection to build your collection
    • Click on the checkmark icon at the top of your history again

Tools used in the “Prepare Data submission” Workflow

Following steps take as input ab1 sequences files and produce filtered FastQ and Fasta files so sequences passing the quality checks are compared to NCBI nucleotidic database using Blastn operation.

Converting Ab1 files to FASTQ

Hands-on: ab1 to FASTQ converter
  1. ab1 to FASTQ converter ( Galaxy version 1.20.0) with the following parameters:
    • param-collection “Input ab1 file”: ab1 data collection created at the previous step

Quality Control

We are doing a first Quality control on the raw files using FastQC and MultiQC.

Hands-on: FastQC
  1. FastQC ( Galaxy version 0.74+galaxy0) with the following parameters:
    • param-file “Raw read data from your current history”: ab1.fastq data collection created at the previous step
  2. MultiQC ( Galaxy version 1.11+galaxy1) with the following parameters:
    • In “Results”:
      • param-repeat “Insert Results”
        • “Which tool was used generate logs?”: FastQC
    • In “FastQC output”:
      • param-file “RawData FastQC output”: FastQC on collection X: data collection created at the previous step
  3. Check on the HTML files the general quality statistics of your sequences
Question: Question
  1. What is the quality of your sequences?
  2. Do you have adapters?
  1. Quality is quite good looking at the “status checks” section of MultiQC. As expected (because here we only have one sequence by file) “Per base sequence Content” and “Overrepresented sequences” Sections are “bad” for both sequences files. “adapter content” section also show a “bad” result for A2_RC_8F2_B.pl_HCOI.ab1 file.

  2. A2_RC_8F2_B.pl_HCOI.ab1 file seems to have adapters in it.

Cleaning the Data

Cutadapt

Cutadapt enables the removal of adapters, polyA tails, and other artifacts from sequences. The tool also filters reads based on quality.

Hands-on: Cutadapt
  1. Cutadapt ( Galaxy version 4.8+galaxy0) with the following parameters:
    • param-collection “FASTQ/A file”: the collection with your data (output of tool ab1 to FastQ converter)
    • “Single-end or Paired-end reads?”: Single-end
    • In “Other Read Trimming Options”:
      • “Quality cutoff(s) (R1)”: 30
      • “Shortening reads to a fixed length”: Disabled
Comment: Suggestions

You may consider changing these parameters depending on the quality of your dataset.

Comment: Quality Control

We do a second quality control similar to the first one to check the quality of the sequences after cleaning them.

Quality Control with FastQC and MultiQC

Hands-on: FastQC
  1. FastQC ( Galaxy version 0.74+galaxy0) with the following parameters:
    • param-collection “Raw read data from your current history”: output from tool Cutadapt
  2. MultiQC ( Galaxy version 1.11+galaxy1) with the following parameters:
    • In “Results”:
      • “Which tool was used generate logs?”: FastQC
      • param-repeat “Insert FastQC output”
        • param-collection “FastQC output”: the raw output from tool FastQC
    Comment: Comment

    You should notice an improvement on the quality of your sequences.

Filtering the collection

Hands-on: Filter empty datasets
  1. Filter empty datasets with the following parameters
    • param-collection “Input Collection”: output collection from Cutadapt step
  2. FASTQ Groomer ( Galaxy version 1.1.5+galaxy2) with the following parameters:
    • param-collection “File to groom” : output collection from the tool Filter empty datasets

    This step is notably there to produce “standardized” fastqsanger sequences files so we can then use other tools accepting only such data format.

  3. Filter FASTQ ( Galaxy version 1.1.5) with the following parameters:
    • “FASTQ File”: output collecton from tool FastQ Groomer
    • “Minimum size”: 300
    Comment: Comment

    Here we descide to only keep sequences of 300bp or above, you may change this parameter depending on your dataset

Changing files names

Hands-on: Extract element identifiers and remove extensions
  1. Extract element identifiers ( Galaxy version 0.0.2)
    • param-collection “Dataset collection”: output from the previous step
  2. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • “Select lines from”: output of the previous step
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: .ab1
        • “Replacement”: ``
    Comment: Comment

    This is to ensure that all your files names end with .fastq.gz

  3. Paste with the following parameters:
    • param-file “Paste”: the file from tool Extract element identifiers
    • param-file “and”: the file from tool Regex Find And Replace
    • param-select “Delimited by”: Tab
  4. Check the datatype
    • should be ‘tabular’. If not, change it now.
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select your desired datatype from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Hands-on: Relabel identifiers
  1. Relabel identifiers with the following parameters:
    • param-collection “Input Collection”: output from tool Filter FastQ
    • “How should the new labels be specified?”: Map original identifiers to new ones using a two column table.

Alignments on NCBI database

Hands-on: NCBI BLAST alignment
  1. FASTQ to FASTA ( Galaxy version 1.1.5) with the following parameters:
    • param-collection “Input FASTQ File”: output collection from tool Relabel Identifiers
  2. NCBI BLAST+ blastn ( Galaxy version 2.14.1+galaxy2) with the following parameters:
    • param-collection Nucleotide query sequence(s): output from the previous step
    • “Subject database/sequences”: Locally installed BLAST database
      • “Nucleotide BLAST database”: NCBI NT (01 Sep 2023)
    • “Output format”: Tabular (extended 25 columns)
    • “Advanced Options”: Hide Advanced Options
Hands-on: Extracting best hits
  1. Unique ( Galaxy version 0.3) with the following parameters:
    • param-collection “File to scan for unique values”: output from the previous step
    • “Advanced Options”: Show Advanced Options
      • “Column start”: c1
      • “Column end”: c1

Workflow Outputs

  1. Collection of raw FASTQ files: Input AB1 files converted into FASTQ files.

  2. Collection of FASTQ files (after quality control): Renamed Fastq files ready for submission after quality control and filtering.

  3. Collection of FASTA files: FASTQ files converted into FASTA format. Used for conducting BLAST alignments.

  4. FastQC Quality Control Results before and after cleaning: Both raw FastQC results and HTML reports are created

  5. MultiQC Quality Control Results before and after cleaning: Both raw MultiQC statistics and HTML report are created

  6. Raw Blast Results: Results of BLAST alignments conducted on our sequences. Columns names are:

    Column NCBI name Description
    1 qaccver Query accession dot version
    2 saccver Subject accession dot version (database hit)
    3 pident Percentage of identical matches
    4 length Alignment length
    5 mismatch Number of mismatches
    6 gapopen Number of gap openings
    7 qstart Start of alignment in query
    8 qend End of alignment in query
    9 sstart Start of alignment in subject (database hit)
    10 send End of alignment in subject (database hit)
    11 evalue Expectation value (E-value)
    12 bitscore Bit score
    13 sallseqid All subject Seq-id(s), separated by a ‘;’
    14 score Raw score
    15 nident Number of identical matches
    16 positive Number of positive-scoring matches
    17 gaps Total number of gaps
    18 ppos Percentage of positive-scoring matches
    19 qframe Query frame
    20 sframe Subject frame
    21 qseq Aligned part of query sequence
    22 sseq Aligned part of subject sequence
    23 qlen Query sequence length
    24 slen Subject sequence length
    25 salltitles All subject title(s), separated by a ‘<>’
  7. Filtered Blast Results Files containing the closest homologous sequences.

  8. Collection of Fastq files Contains filtered sequences.

How to use ENA upload Tool

Adding ENA “Webin” credentials to your Galaxy user information

Comment: Having an ENA Submission Account

Make sure you have a submission account with the European Nucleotide Archive (ENA). You will need the identifier and the password, available through https://www.ebi.ac.uk/ena/submit/webin/login.

Hands-on: Add your "WEBIN" credentials to your Galaxy account

Instructions: - From the Menu, click on “User” > “Preferences”. Click on “Manage Information”. Scroll down to “Your ENA Webin account details” and enter your ENA “Webin” identifier and password. Adding ENA Webin credentials.

Submitting using a metadata template file

For this tutorial we will use the ENA default sample checklist.

Excel Metadata template.

Note: It is crucial to fill in all the fields marked “Mandatory” and ensure that the sequence names match exactly those indicated in the Excel file.

Comment: ENA Metadata Templates

You can find metadata templates for each checklist in the ELIXIR-Belgium GitHub repository

  1. Direct download link of the ENA default sample checklist

  2. Direct download link of the ENA default sample checklist filled with elements for the training

You will need to import this file into your Galaxy history. Then, use the ENA Upload Tool to proceed with the submission.

Hands-on: Excel Metadata Template
  1. Import the ENA default sample checklist file.

    https://github.com/galaxyproject/training-material/raw/24776cf161e38ac0449755749d23e851400020aa/topics/ecology/tutorials/ENA_Biodiv_submission/metadata_GdBqCOI_ERC000011_Test.xlsx
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  2. ENA Upload tool ( Galaxy version 1.11+galaxy1) with the following parameters:

    • “Action to execute”: Add new (meta)data
    • “Select the metadata input method”: Excel file
    • “Select the ENA sample checklist”: ENA default sample checklist (ERC000011)
    • “Select Excel file based on template”: metadata_GdBqCOI_ERC000011_Test.xlsx
    • “Select input data”: Dataset or dataset collection
    • “Add .fastq (.gz, .bz2) extension to the Galaxy dataset names to match the ones described in the input tables?”: Yes
    Comment: Datatype

    The ENA upload tool will then automatically compress fastq sequences files into .fastq.gz format before submission

    Warning: Danger: Submit to ENA test server!

    We suggest you first submit to the ENA test server before making a public submission! Submission can be seen in Dashboard/Study Report

ENA Upload tool.

Conclusion

This tutorial guides you through quality check and preparing raw data files for ENA submission. You can then verify that your sequences have been successfully sent by logging into the Test ENA portal (https://wwwdev.ebi.ac.uk/ena/submit/webin/login) and navigating to the Study Report section.