Clean and manage Sanger sequences from raw files to aligned consensus

Authors: orcid logoColine Royaux avatar Coline Royaux
Overview
Creative Commons License: CC-BY Questions:
  • How to clean Sanger sequencing files?

Objectives:
  • Learn how to manage sequencing files (AB1, FASTQ, FASTA)

  • Learn how to clean your Sanger sequences in an automated and reproducible way

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Jan 8, 2024
Last modification: Mar 5, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00383
rating Rating: 2.0 (2 recent ratings, 2 all time)
version Revision: 2

The objective of this tutorial is to learn how to clean and manage AB1 data files freshly obtained from Sanger sequencing. This kind of sequencing is targeting a specific sequence with short single DNA strands called primers. These primers are delimiting ends of the targeted marker. Usually, one gets two .ab1 files for each sample, representing the sense (forward) and the antisense (reverse) strands of DNA.

Here, we’ll be using raw data from AOPEP variants as a novel cause of recessive dystonia: Generalized dystonia and dystonia-parkinsonism” 2022. In this article, two DNA markers are investiguated CHD8 (Chromodomain-helicase-DNA-binding protein 8) and AOPEP (Aminopeptidase O Putative). We’ll focus on CHD8 sequences but you can try to apply the same steps on the AOPEP sequences to practice after the tutorial !

In the first section of the tutorial, we’ll be preparing primer’s data by:

  • selecting the right primer sequences with the identifier;
  • removing eventual gaps included in the sequences;
  • and compute the reverse-complement sequence for the antisense primer only.

In the second section of the tutorial, we’ll be preparing the Sanger sequences data by:

  • extracting ab1 files of the interest sequence (CHD8) and separating sense and antisense sequences in two distinct data collections;
  • converting ab1 files to FASTQ to permit its use in the following tools;
  • trimming low quality ends of the sequences;
  • compute the reverse-complement for the antisense sequence only;
  • align sense and antisense sequences;
  • obtain a consensus sequence (which results the correspondance between nucleotides of the sense and the antisense sequences) for each three samples.

In the third section of the tutorial, primers and all consensus sequences are finally merged into a single file to be aligned and verified.

Consider a double-strand DNA molecule with the following sequences:

Double-strand DNA. Open image in new tab

Figure 1: Double-strand DNA

When sequencing, each strand of DNA are read separately in the 5’-3’ orientation. Hence, in the sequence files each strand are provided as:

Single-strand DNA sequences in output file. Open image in new tab

Figure 2: Single-strand DNA sequences in output file

To get the antisense sequence in its original orientation, the reverse sequence is computed:

Reversed antisense sequence. Open image in new tab

Figure 3: Reversed antisense sequence

To align sense and antisense sequence, the complement sequence of the reversed antisense sequence is computed:

Reverse-complement antisense sequence. Open image in new tab

Figure 4: Reversed antisense sequence

The two sequences can be aligned now:

Aligned sense and antisense sequences. Open image in new tab

Figure 5: Aligned sense and antisense sequences
Agenda

In this tutorial, we will cover:

  1. Get data
  2. Prepare primer data
    1. Separate and format primers files
  3. Prepare sequence data
    1. Unzip data files
    2. Filter collection to separate sense and antisense sequence files
    3. Convert AB1 sequence files to FASTQ and trim low-quality ends
    4. Compute reverse complement sequence for antisense (reverse) sequences only
    5. Merge corresponding sense and antisense sequences single files
    6. Convert FASTQ files to FASTA
    7. Align sequences and retrieve consensus for each sequence
  4. Manage primers and sequences
    1. Merge and align consensus sequences file and primer files
    2. Check your sequences belongs to the right taxonomic group by computing a BLAST on the NCBI database
  5. Conclusion
  6. AOPEP Sanger files

Get data

Authors of AOPEP variants as a novel cause of recessive dystonia: Generalized dystonia and dystonia-parkinsonism” 2022 have shared openly their raw AB1 files on Zenodo.

Hands-on: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo :

    https://zenodo.org/records/7104640/files/AOPEP_and_CHD8_sequences_20220907.zip
    

    Change Type (set all): from “Auto-detect” to zip and click Start

    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Create primer FASTA file, copy:
    >Forward_CHD8
    GAGGTGAAAGAATCATAAATTGG
    >Reverse_CHD8
    CCCTGTGTACAAATAGCTTTTGT
    >Forward_AOPEP
    TCATGGTTCCAGGCAGAGTTATT
    >Reverse_AOPEP
    TGCTGTGACAAGCCAACCAATGG
    
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
    • Select Paste/Fetch Data
    • Paste into the text field
    • Change Type (set all): from “Auto-detect” to fasta
    • Change the name from “New File” to “Primer file”
    • Click Start

    Note these primer sequences were invented for the purpose of the tutorial, it is not the sequences used in the publication.

Prepare primer data

Separate and format primers files

Primers must be separated in distinct files because sense (forward) and antisense (reverse) primers won’t be subjected to the same formating.

Hands-on: Create separate files for each primer
  1. Filter FASTA ( Galaxy version 2.3) with the following parameters:
    • param-file “FASTA sequences”: Primer file
    • “Criteria for filtering on the headers”: Regular expression on the headers
      • “Regular expression pattern the header should match”: Reverse_CHD8
    • Add tags “#Primer” and “#Reverse”

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

    1. Expand one of the output datasets of the tool (by clicking on it)
    2. Click re-run galaxy-refresh the tool

    This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

  2. Filter FASTA ( Galaxy version 2.3) with the following parameters:
    • param-file “FASTA sequences”: Primer file
    • “Criteria for filtering on the headers”: Regular expression on the headers
      • “Regular expression pattern the header should match”: Forward_CHD8
    • Add tags “#Primer” and “#Forward”

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

  3. Remove eventual gaps from primers Degap.seqs ( Galaxy version 1.39.5.0) with the following parameters:

    1. Click on param-files Multiple datasets
    2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest

    • param-files “fasta - Dataset”: Two Filter FASTA outputs (outputs of Filter FASTA tool)

In this previous hands-on, the step of removing eventual gaps (- in the FASTA files) is a precaution, there are no gaps in our primers file. However, it is important to remove gaps at this point in case you are using different data, otherwise some steps of the tutorial could fail (e.g. alignment).

This following hands-on is to be applied only on the sequence of the antisense (reverse) primer.

Hands-on: Compute Reverse-Complement of the antisense (reverse) primer
  1. Reverse-Complement ( Galaxy version 1.0.2+galaxy0) the sequence antisense (reverse) primer with the following parameters:
    • param-file “Input file in FASTA or FASTQ format”: Degap.seqs #Reverse FASTA output (output of Degap.seqs tool)

See in the introduction for explanations on the Reverse-Complement.

Prepare sequence data

Unzip data files

Hands-on: Unzip
  1. Unzip ( Galaxy version 6.0+galaxy0) with the following parameters:
    • param-file “input_file”: AOPEP_and_CHD8_sequences_20220907.zip?download=1
    • “Extract single file”: All files
Question

How many files is there in the ZIP archive ?

12 (if you have a different number of files something likely went wrong)

From now on, we’ll be working a lot on data collections:

  1. Click on param-collection Dataset collection in front of the input parameter you want to supply the collection to.
  2. Select the collection you want to use from the list

Filter collection to separate sense and antisense sequence files

As for primers, sense and antisense sequences will be subjected to slightly different procedures so they must be separated in distinct data collections.

Hands-on: Filter
  1. Extract element identifiers ( Galaxy version 0.0.2) with the following parameters:
    • param-collection “Dataset collection”: output collection (output of Unzip tool)
  2. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • param-file “Select lines from”: output (output of Extract element identifiers tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: ^[A-Za-z0-9_-]+F$
        • “Replacement”: ``
      • param-repeat “Insert Check”
        • “Find Regex”: ^[A-Za-z0-9_-]+AOPEP[A-Za-z0-9_-]+$
        • “Replacement”: ``
    • Tag output with “#Reverse”

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

    1. Expand one of the output datasets of the tool (by clicking on it)
    2. Click re-run galaxy-refresh the tool

    This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

  3. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • param-file “Select lines from”: output (output of Extract element identifiers tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: ^[A-Za-z0-9_-]+R$
        • “Replacement”: ``
      • param-repeat “Insert Check”
        • “Find Regex”: ^[A-Za-z0-9_-]+AOPEP[A-Za-z0-9_-]+$
        • “Replacement”: ``
    • Tag output with “#Forward”

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

  4. Filter collection with the following parameters:
    • param-collection “Input Collection: output collection (output of Unzip tool)
    • “How should the elements to remove be determined?”: Remove if identifiers are ABSENT from file
      • param-files “Filter out identifiers absent from”: #Forward files list & #Reverse files list (output of Regex Find And Replace tool)
    • Tag (filtered) outputs with “#Forward” and “#Reverse”

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

    Comment: What's happening in this section?

    First step: Extracting the list of file names in the data collection Second step: Removing file names containing a “F” and “AOPEP” -> creating a list of antisense (reverse) sequence files of the marker CHD8 Third step: Removing file names containing a “R” and “AOPEP” -> creating a list of sense (forward) sequence files of the marker CHD8 Fourth step: Select files in the collection -> creating two distinct collections with sense (forward) sequence files on one hand and antisense (reverse) sequence file on the other hand

    For the second and third step, we used regular expressions (Regex):

    Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.

    Finding

    Below are just a few examples of basic expressions:

    Regular expression Matches
    abc an occurrence of abc within your data
    (abc|def) abc or def
    [abc] a single character which is either a, b, or c
    [^abc] a character that is NOT a, b, nor c
    [a-z] any lowercase letter
    [a-zA-Z] any letter (upper or lower case)
    [0-9] numbers 0-9
    \d any digit (same as [0-9])
    \D any non-digit character
    \w any alphanumeric character
    \W any non-alphanumeric character
    \s any whitespace
    \S any non-whitespace character
    . any character
    \.  
    {x,y} between x and y repetitions
    ^ the beginning of the line
    $ the end of the line

    Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.

    Examples

    Regular expression matches
    \d{4} 4 digits (e.g. a year)
    chr\d{1,2} chr followed by 1 or 2 digits
    .*abc$ anything with abc at the end of the line
    ^$ empty line
    ^>.* Line starting with > (e.g. Fasta header)
    ^[^>].* Line not starting with > (e.g. Fasta sequence)

    Replacing

    Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.

    Regular expression Input Captures
    chr(\d{1,2}) chr14 \1 = 14
    (\d{2}) July (\d{4}) 24 July 1984 \1 = 24, \2 = 1984

    An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.

    Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.

    You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.

    Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.

    There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.

    Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.

    Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.

    Tip: Cyrilex is a visual regular expression tester.

    With [A-Za-z0-9_-] meaning any character between A to Z, a to z, 0 to 9 or _ or -, the following + meaning that any of these characters are found once or more.

Convert AB1 sequence files to FASTQ and trim low-quality ends

In Sanger sequencing, ends tend to be of low trust levels (each nucleotide has a quality score reflecting this trust level), it is important to delete these sections of the sequences to ensure wrong nucleotides aren’t introduced in the sequences.

Hands-on: AB1 to FASTQ files and trim low quality ends

Do these steps twice !! We have Froward and antisense (reverse) sequence data collections, do these steps starting with each “(filtered)” data collections, this could help:

  1. Expand one of the output datasets of the tool (by clicking on it)
  2. Click re-run galaxy-refresh the tool

This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

  1. ab1 to FASTQ converter ( Galaxy version 1.20.0) with the following parameters:
    • param-collection “Input ab1 file”: (filtered) output collection (output of Filter collection tool)
    • “Do you want trim ends according to quality scores ?”: No, use full sequences.

In this tool, it is possible to trim low-quality ends along with the conversion of the file but parametrization is less precise.

  1. seqtk_trimfq ( Galaxy version 1.3.1) with the following parameters:
    • param-collection “Input FASTA/Q file”: output collection (output of ab1 to FASTQ converter tool)
    • “Mode for trimming FASTQ File”: Quality
      • “Maximally trim down to INT bp”: 0

Compute reverse complement sequence for antisense (reverse) sequences only

See in the introduction for explanations on the Reverse-Complement.

Hands-on: Reverse complement
  1. FASTQ Groomer ( Galaxy version 1.1.5) with the following parameters:
    • param-collection “File to groom”: #Reverse output collection (output of seqtk_trimfq tool)
    • “Advanced Options”: Show Advanced Options
      • “Summarize input data”: Do not Summarize Input (faster)
    Comment: What is this step?

    It is a necessary step to get the right input format for the following step Reverse-Complement tool

  2. Reverse-Complement ( Galaxy version 1.0.2+galaxy0) with the following parameters:
    • param-collection “Input file in FASTA or FASTQ format”: #Reverse output collection (output of FASTQ Groomer tool)

Merge corresponding sense and antisense sequences single files

Hands-on: Sort collections

Do this step twice !! One has to make sure sense (forward) and antisense (reverse) sequences collections are in the same order to get the right sense and the right antisense sequence to be merged together

  1. Expand one of the output datasets of the tool (by clicking on it)
  2. Click re-run galaxy-refresh the tool

This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

  1. Sort collection with the following parameters:
    • param-collection “Input Collection”: Collection (output of seqtk_trimfq tool & output of Reverse-Complement tool)
    • “Sort type”: alphabetical
Hands-on: Merge sense (forward) and antisense (reverse) sequence files
  1. seqtk_mergepe ( Galaxy version 1.3.1) with the following parameters:
    • param-collection “Input FASTA/Q file #1”: output (output of Sort collection tool)
    • param-collection “Input FASTA/Q file #2”: output (output of Sort collection tool)

Check there is two sequences in each three files of the newly-created collection.

Convert FASTQ files to FASTA

Hands-on: FASTQ to FASTA
  1. FASTQ Groomer ( Galaxy version 1.1.5) with the following parameters:
    • param-collection “File to groom”: default (output of seqtk_mergepe tool)
    • “Advanced Options”: Show Advanced Options
      • “Summarize input data”: Do not Summarize Input (faster)
    Comment: What is this step?

    It is a necessary step to get the right input format for the following step FASTQ to FASTA tool

  2. FASTQ to FASTA ( Galaxy version 1.0.2+galaxy2) with the following parameters:
    • param-collection “FASTQ file to convert”: output collection (output of FASTQ Groomer tool)
    • “Discard sequences with unknown (N) bases”: no
    • “Rename sequence names in output file (reduces file size)”: no
    • “Compress output FASTA”: No
    Comment: information

    If this step doesn’t work, one can try tools FASTQ to tabular tool and tabular to FASTA tool instead

Align sequences and retrieve consensus for each sequence

Hands-on: Align and consensus
  1. Align sequences ( Galaxy version 1.9.1.0) with the following parameters:
    • param-collection “Input fasta file”: output collection (output of FASTQ-to-FASTA tool)
    • “Method for aligning sequences”: clustalw
    • “Minimum percent sequence identity to closest blast hit to include sequence in alignment”: 0.1
  2. Consensus sequence from aligned FASTA ( Galaxy version 1.0.0) with the following parameters:
    • param-collection “Input fasta file with at least two sequences”: aligned_sequences (output of Align sequences tool)
    • Add tag “#Consensus”
  3. Merge.files ( Galaxy version 1.39.5.0) with the following parameters:
    • “Merge”: fasta files
      • param-collection “inputs - fasta”: output collection (output of Consensus sequence from aligned FASTA tool)

Manage primers and sequences

Merge and align consensus sequences file and primer files

Hands-on: Merge and format consensus sequences + primers file
  1. Merge.files ( Galaxy version 1.39.5.0) with the following parameters:
    • “Merge”: fasta files
      • param-files “inputs - fasta”: consensus sequences (output of Merge.files tool), Reverse primer (output of Reverse-Complement tool), Forward primer (output of Degap.seqs tool)
    1. Click on param-files Multiple datasets
    2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest

    • Remove tags “#Forward” and “#Reverse”
  2. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • param-file “Select lines from”: output (output of Merge.files tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: ([A-Z-])>
        • “Replacement”: \1\n>
    Comment: What's going on in this second step?

    Sometimes, Merge.files tool doesn’t keep linefeed between the files, this step permits to correct it and get a FASTA file that is formatted properly.

    For the second step, we used regular expressions (Regex):

    Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.

    Finding

    Below are just a few examples of basic expressions:

    Regular expression Matches
    abc an occurrence of abc within your data
    (abc|def) abc or def
    [abc] a single character which is either a, b, or c
    [^abc] a character that is NOT a, b, nor c
    [a-z] any lowercase letter
    [a-zA-Z] any letter (upper or lower case)
    [0-9] numbers 0-9
    \d any digit (same as [0-9])
    \D any non-digit character
    \w any alphanumeric character
    \W any non-alphanumeric character
    \s any whitespace
    \S any non-whitespace character
    . any character
    \.  
    {x,y} between x and y repetitions
    ^ the beginning of the line
    $ the end of the line

    Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.

    Examples

    Regular expression matches
    \d{4} 4 digits (e.g. a year)
    chr\d{1,2} chr followed by 1 or 2 digits
    .*abc$ anything with abc at the end of the line
    ^$ empty line
    ^>.* Line starting with > (e.g. Fasta header)
    ^[^>].* Line not starting with > (e.g. Fasta sequence)

    Replacing

    Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.

    Regular expression Input Captures
    chr(\d{1,2}) chr14 \1 = 14
    (\d{2}) July (\d{4}) 24 July 1984 \1 = 24, \2 = 1984

    An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.

    Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.

    You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.

    Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.

    There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.

    Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.

    Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.

    Tip: Cyrilex is a visual regular expression tester.

    With [A-Z-] meaning any character between A to Z or -, \1 repeat the character chain between brackets in the “Find Regex” section, \n meaning a line-feed.

When you have the consensus sequences, you can check if any ambiguous nucleotide is to be found in the sequences. If you find such nucleotides, it means different nucleotides were found in the sense and antisense sequence at the same position, some checks are needed.

  • Y = C or T
  • R = A or G
  • W = A or T
  • S = G or C
  • K = T or G
  • M = C or A
Hands-on: Look for ambiguous nucleotides
  1. Click on output of Regex Find and Replace tool in the history to expand it

  2. Click on galaxy-barchart Visualize

  3. Select Multiple Sequence Alignment

  4. Set color scheme to Clustal, ambiguous nucleotides are highlighted in dark blue

  5. There are two nucleotide positions to check, Y at 121 in sequence consensus_B05_CHD8-III6brother-18 and W at 286 in sequence consensus_05_CHD8-III6mother-18

  6. You need to go back to your FASTQ sequences to understand the origin of the ambiguity

  7. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • param-file “Select lines from”: #Consensus #Primer output (output of Regex Find and Replace tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: ^[ACTG]+([ACTG]{20}Y)[ACTG]+$
        • “Replacement”: \1
      • param-repeat “Insert Check”
        • “Find Regex”: ^[ACTG]+([ACTG]{20}W)[ACTG]+$
        • “Replacement”: \1
    Comment: What's going on in this step?

    We want to retrieve the 20 nucleotides before the ambiguities.

    We use regular expressions (Regex):

    Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.

    Finding

    Below are just a few examples of basic expressions:

    Regular expression Matches
    abc an occurrence of abc within your data
    (abc|def) abc or def
    [abc] a single character which is either a, b, or c
    [^abc] a character that is NOT a, b, nor c
    [a-z] any lowercase letter
    [a-zA-Z] any letter (upper or lower case)
    [0-9] numbers 0-9
    \d any digit (same as [0-9])
    \D any non-digit character
    \w any alphanumeric character
    \W any non-alphanumeric character
    \s any whitespace
    \S any non-whitespace character
    . any character
    \.  
    {x,y} between x and y repetitions
    ^ the beginning of the line
    $ the end of the line

    Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.

    Examples

    Regular expression matches
    \d{4} 4 digits (e.g. a year)
    chr\d{1,2} chr followed by 1 or 2 digits
    .*abc$ anything with abc at the end of the line
    ^$ empty line
    ^>.* Line starting with > (e.g. Fasta header)
    ^[^>].* Line not starting with > (e.g. Fasta sequence)

    Replacing

    Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.

    Regular expression Input Captures
    chr(\d{1,2}) chr14 \1 = 14
    (\d{2}) July (\d{4}) 24 July 1984 \1 = 24, \2 = 1984

    An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.

    Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.

    You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.

    Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.

    There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.

    Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.

    Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.

    Tip: Cyrilex is a visual regular expression tester.

    With [ACTG] meaning any character of the four unambiguous nucleotides followed by + meaning “at least once in the character chain” or by {20} meaning “20 times”.

    In the output of this tool we get: - the 20 nucleotides before the Y at position 121 in sequence consensus_B05_CHD8-III6brother-18: CAGGCACGATGTCATCGAAT - and the 20 nuleotides before the W at position 286 in sequence consensus_05_CHD8-III6mother-18: AGTCCTCTTAGTTTATAGAT

  8. FASTQ masker ( Galaxy version 1.1.5) with the following parameters:
    • param-collection “File to mask”: #Forward #Reverse collection (output of FASTQ groomer tool)
    • “Mask input with”: Lowercase
    • “Quality score”: 10

This tool displays low-quality bases in lowercase to permit better detection of potential errors.

  1. Open galaxy-eye B05_CHD8-III6brother-18 output of FASTQ masker tool and ctrl+f : CAGGCACGATGTCATCGAAT. In the sense sequence (ID ending with 18F), this fragment is followed by a c in low-quality, whereas in the antisense sequence it is followed by a T in decent quality. Additionally, when looking into the galaxy-eye #Consensus #Primer output of Regex Find and Replace tool, we can see the two other consensus sequences (consensus_05_CHD8-III6mother-18 and consensus_07_CHD8-III6-18) have a T at this same position. It seems more likely that the nucleotide at position 121 in sequence consensus_B05_CHD8-III6brother-18 is a T.

  2. Open galaxy-eye 05_CHD8-III6mother-18 outputs of FASTQ masker tool and ctrl+f : AGTCCTCTTAGTTTATAGAT. In the antisense sequence (ID ending with 18R), this fragment is followed by a t in low-quality, whereas in the sense sequence it is followed by a A in decent quality. Additionally, when looking into the galaxy-eye #Consensus #Primer output of Regex Find and Replace tool, we can see the two other consensus sequences (consensus_B05_CHD8-III6brother-18 and consensus_07_CHD8-III6-18) have a A at this same position. It seems more likely that the nucleotide at position 286 in sequence consensus_05_CHD8-III6mother-18 is a A.

  3. You can now correct them by clicking on output of Regex Find and Replace tool in the history to expand it

  4. Click on galaxy-barchart Visualize

  5. Select Editor and:

    • replace manually the Y with T in consensus_B05_CHD8-III6brother-18
    • replace manually the W with A in consensus_05_CHD8-III6mother-18 and click on export

Now, one can align its sequences with primers. Ultimately, it is common to cut sequences between primers to get the right fragment for each sequence.

Hands-on: Align sequences and primers
  1. Align sequences ( Galaxy version 1.9.1.0) with the following parameters:
    • param-file “Input fasta file”: out_file1 Regex Find And Replace (modified)
    • “Method for aligning sequences”: mafft
    • “Minimum percent sequence identity to closest blast hit to include sequence in alignment”: 0.1

Check your sequences belongs to the right taxonomic group by computing a BLAST on the NCBI database

Hands-on: NVBI Blast
  1. NCBI BLAST+ blastn ( Galaxy version 2.10.1+galaxy2) with the following parameters:
    • param-file “Nucleotide query sequence(s)”: out_file1 (output of Regex Find And Replace tool)
    • “Subject database/sequences”: Locally installed BLAST database
      • “Nucleotide BLAST database”: select most recent nt_ database
    • “Output format”: Tabular (select which columns)
      • “Standard columns”: qseqid, pident, mismatch and gapopen
      • “Extended columns”: gaps and salltitles
      • “Other identifier columns”: saccver
    • “Advanced Options”: Show Advanced Options
      • “Maximum hits to consider/show”: 10
      • “Restrict search of database to a given set of ID’s”: No restriction, search the entire database
Question

The sequences we cleaned belong to what species?

Homo sapiens

It is a good practice to proceed to such checks, its permits to make sure the sequencing went as planned and your samples haven’t been contaminated.

Conclusion

We successfully cleaned AB1 sequence files !

AOPEP Sanger files

The history following the same steps but for AOPEP marker files is available: Clean AOPEP sequences