The objective of this tutorial is to learn how to clean and manage AB1 data files freshly obtained from Sanger sequencing.
This kind of sequencing is targeting a specific sequence with short single DNA strands called primers. These primers are delimiting ends of the targeted marker.
Usually, one gets two .ab1 files for each sample, representing the sense (forward) and the antisense (reverse) strands of DNA.
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
Select Paste/Fetch Data
Paste into the text field
Change Type (set all): from “Auto-detect” to fasta
Change the name from “New File” to “Primer file”
Click Start
Note these primer sequences were invented for the purpose of the tutorial, it is not the sequences used in the publication.
Prepare primer data
Separate and format primers files
Primers must be separated in distinct files because sense (forward) and antisense (reverse) primers won’t be subjected to the same formating.
Hands On: Create separate files for each primer
Filter FASTA ( Galaxy version 2.3) with the following parameters:
param-file“FASTA sequences”: Primer file
“Criteria for filtering on the headers”: Regular expression on the headers
“Regular expression pattern the header should match”: Reverse_CHD8
Add tags “#Primer” and “#Reverse”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
Expand one of the output datasets of the tool (by clicking on it)
Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
Filter FASTA ( Galaxy version 2.3) with the following parameters:
param-file“FASTA sequences”: Primer file
“Criteria for filtering on the headers”: Regular expression on the headers
“Regular expression pattern the header should match”: Forward_CHD8
Add tags “#Primer” and “#Forward”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
Remove eventual gaps from primers Degap.seqs ( Galaxy version 1.39.5.0) with the following parameters:
Click on param-filesMultiple datasets
Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
param-files“fasta - Dataset”: Two Filter FASTA outputs (outputs of Filter FASTAtool)
In this previous hands-on, the step of removing eventual gaps (- in the FASTA files) is a precaution, there are no gaps in our primers file. However, it is important to remove gaps at this point in case you are using different data, otherwise some steps of the tutorial could fail (e.g. alignment).
This following hands-on is to be applied only on the sequence of the antisense (reverse) primer.
Hands On: Compute Reverse-Complement of the antisense (reverse) primer
Reverse-Complement ( Galaxy version 1.0.2+galaxy0) the sequence antisense (reverse) primer with the following parameters:
param-file“Input file in FASTA or FASTQ format”: Degap.seqs #Reverse FASTA output (output of Degap.seqstool)
See in the introduction for explanations on the Reverse-Complement.
Prepare sequence data
Unzip data files
Hands On: Unzip
Unzip ( Galaxy version 6.0+galaxy0) with the following parameters:
12 (if you have a different number of files something likely went wrong)
From now on, we’ll be working a lot on data collections:
Click on param-collectionDataset collection in front of the input parameter you want to supply the collection to.
Select the collection you want to use from the list
Filter collection to separate sense and antisense sequence files
As for primers, sense and antisense sequences will be subjected to slightly different procedures so they must be separated in distinct data collections.
Hands On: Filter
Extract element identifiers ( Galaxy version 0.0.2) with the following parameters:
param-collection“Dataset collection”: output collection (output of Unziptool)
Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
param-file“Select lines from”: output (output of Extract element identifierstool)
In “Check”:
param-repeat“Insert Check”
“Find Regex”: ^[A-Za-z0-9_-]+F$
“Replacement”: ``
param-repeat“Insert Check”
“Find Regex”: ^[A-Za-z0-9_-]+AOPEP[A-Za-z0-9_-]+$
“Replacement”: ``
Tag output with “#Reverse”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
Expand one of the output datasets of the tool (by clicking on it)
Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
param-file“Select lines from”: output (output of Extract element identifierstool)
In “Check”:
param-repeat“Insert Check”
“Find Regex”: ^[A-Za-z0-9_-]+R$
“Replacement”: ``
param-repeat“Insert Check”
“Find Regex”: ^[A-Za-z0-9_-]+AOPEP[A-Za-z0-9_-]+$
“Replacement”: ``
Tag output with “#Forward”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
param-collection“Input Collection: output collection (output of Unziptool)
“How should the elements to remove be determined?”: Remove if identifiers are ABSENT from file
param-files“Filter out identifiers absent from”: #Forward files list & #Reverse files list (output of Regex Find And Replacetool)
Tag (filtered) outputs with “#Forward” and “#Reverse”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
First step: Extracting the list of file names in the data collection
Second step: Removing file names containing a “F” and “AOPEP” -> creating a list of antisense (reverse) sequence files of the marker CHD8
Third step: Removing file names containing a “R” and “AOPEP” -> creating a list of sense (forward) sequence files of the marker CHD8
Fourth step: Select files in the collection -> creating two distinct collections with sense (forward) sequence files on one hand and antisense (reverse) sequence file on the other hand
For the second and third step, we used regular expressions (Regex):
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression
Matches
abc
an occurrence of abc within your data
(abc|def)
abcordef
[abc]
a single character which is either a, b, or c
[^abc]
a character that is NOT a, b, nor c
[a-z]
any lowercase letter
[a-zA-Z]
any letter (upper or lower case)
[0-9]
numbers 0-9
\d
any digit (same as [0-9])
\D
any non-digit character
\w
any alphanumeric character
\W
any non-alphanumeric character
\s
any whitespace
\S
any non-whitespace character
.
any character
\.
{x,y}
between x and y repetitions
^
the beginning of the line
$
the end of the line
Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.
Examples
Regular expression
matches
\d{4}
4 digits (e.g. a year)
chr\d{1,2}
chr followed by 1 or 2 digits
.*abc$
anything with abc at the end of the line
^$
empty line
^>.*
Line starting with > (e.g. Fasta header)
^[^>].*
Line not starting with > (e.g. Fasta sequence)
Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.
Regular expression
Input
Captures
chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984
\1 = 24, \2 = 1984
An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.
Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.
You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.
Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.
There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip:RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip:Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip:Cyrilex is a visual regular expression tester.
With [A-Za-z0-9_-] meaning any character between A to Z, a to z, 0 to 9 or _ or -, the following + meaning that any of these characters are found once or more.
Convert AB1 sequence files to FASTQ and trim low-quality ends
In Sanger sequencing, ends tend to be of low trust levels (each nucleotide has a quality score reflecting this trust level), it is important to delete these sections of the sequences to ensure wrong nucleotides aren’t introduced in the sequences.
Hands On: AB1 to FASTQ files and trim low quality ends
Do these steps twice !! We have Froward and antisense (reverse) sequence data collections, do these steps starting with each “(filtered)” data collections, this could help:
Expand one of the output datasets of the tool (by clicking on it)
Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
ab1 to FASTQ converter ( Galaxy version 1.20.0) with the following parameters:
param-collection“Input ab1 file”: (filtered) output collection (output of Filter collectiontool)
“Do you want trim ends according to quality scores ?”: No, use full sequences.
In this tool, it is possible to trim low-quality ends along with the conversion of the file but parametrization is less precise.
seqtk_trimfq ( Galaxy version 1.3.1) with the following parameters:
param-collection“Input FASTA/Q file”: output collection (output of ab1 to FASTQ convertertool)
“Mode for trimming FASTQ File”: Quality
“Maximally trim down to INT bp”: 0
Compute reverse complement sequence for antisense (reverse) sequences only
See in the introduction for explanations on the Reverse-Complement.
Hands On: Reverse complement
FASTQ Groomer ( Galaxy version 1.1.5) with the following parameters:
param-collection“File to groom”: #Reverse output collection (output of seqtk_trimfqtool)
“Advanced Options”: Show Advanced Options
“Summarize input data”: Do not Summarize Input (faster)
Comment: What is this step?
It is a necessary step to get the right input format for the following step Reverse-Complementtool
Reverse-Complement ( Galaxy version 1.0.2+galaxy0) with the following parameters:
param-collection“Input file in FASTA or FASTQ format”: #Reverse output collection (output of FASTQ Groomertool)
Merge corresponding sense and antisense sequences single files
Hands On: Sort collections
Do this step twice !! One has to make sure sense (forward) and antisense (reverse) sequences collections are in the same order to get the right sense and the right antisense sequence to be merged together
Expand one of the output datasets of the tool (by clicking on it)
Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
Sort collection with the following parameters:
param-collection“Input Collection”: Collection (output of seqtk_trimfqtool & output of Reverse-Complementtool)
“Sort type”: alphabetical
Hands On: Merge sense (forward) and antisense (reverse) sequence files
seqtk_mergepe ( Galaxy version 1.3.1) with the following parameters:
param-collection“Input FASTA/Q file #1”: output (output of Sort collectiontool)
param-collection“Input FASTA/Q file #2”: output (output of Sort collectiontool)
Check there is two sequences in each three files of the newly-created collection.
Convert FASTQ files to FASTA
Hands On: FASTQ to FASTA
FASTQ Groomer ( Galaxy version 1.1.5) with the following parameters:
param-collection“File to groom”: default (output of seqtk_mergepetool)
“Advanced Options”: Show Advanced Options
“Summarize input data”: Do not Summarize Input (faster)
Comment: What is this step?
It is a necessary step to get the right input format for the following step FASTQ to FASTAtool
FASTQ to FASTA ( Galaxy version 1.0.2+galaxy2) with the following parameters:
param-collection“FASTQ file to convert”: output collection (output of FASTQ Groomertool)
“Discard sequences with unknown (N) bases”: no
“Rename sequence names in output file (reduces file size)”: no
“Compress output FASTA”: No
Comment: information
If this step doesn’t work, one can try tools FASTQ to tabulartool and tabular to FASTAtool instead
Align sequences and retrieve consensus for each sequence
Hands On: Align and consensus
Align sequences ( Galaxy version 1.9.1.0) with the following parameters:
param-collection“Input fasta file”: output collection (output of FASTQ-to-FASTAtool)
“Method for aligning sequences”: clustalw
“Minimum percent sequence identity to closest blast hit to include sequence in alignment”: 0.1
Consensus sequence from aligned FASTA ( Galaxy version 1.0.0) with the following parameters:
param-collection“Input fasta file with at least two sequences”: aligned_sequences (output of Align sequencestool)
Add tag “#Consensus”
Merge.files ( Galaxy version 1.39.5.0) with the following parameters:
“Merge”: fasta files
param-collection“inputs - fasta”: output collection (output of Consensus sequence from aligned FASTAtool)
Manage primers and sequences
Merge and align consensus sequences file and primer files
Hands On: Merge and format consensus sequences + primers file
Merge.files ( Galaxy version 1.39.5.0) with the following parameters:
“Merge”: fasta files
param-files“inputs - fasta”: consensus sequences (output of Merge.filestool), Reverse primer (output of Reverse-Complementtool), Forward primer (output of Degap.seqstool)
Click on param-filesMultiple datasets
Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
Remove tags “#Forward” and “#Reverse”
Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
param-file“Select lines from”: output (output of Merge.filestool)
In “Check”:
param-repeat“Insert Check”
“Find Regex”: ([A-Z-])>
“Replacement”: \1\n>
Comment: What's going on in this second step?
Sometimes, Merge.filestool doesn’t keep linefeed between the files, this step permits to correct it and get a FASTA file that is formatted properly.
For the second step, we used regular expressions (Regex):
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression
Matches
abc
an occurrence of abc within your data
(abc|def)
abcordef
[abc]
a single character which is either a, b, or c
[^abc]
a character that is NOT a, b, nor c
[a-z]
any lowercase letter
[a-zA-Z]
any letter (upper or lower case)
[0-9]
numbers 0-9
\d
any digit (same as [0-9])
\D
any non-digit character
\w
any alphanumeric character
\W
any non-alphanumeric character
\s
any whitespace
\S
any non-whitespace character
.
any character
\.
{x,y}
between x and y repetitions
^
the beginning of the line
$
the end of the line
Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.
Examples
Regular expression
matches
\d{4}
4 digits (e.g. a year)
chr\d{1,2}
chr followed by 1 or 2 digits
.*abc$
anything with abc at the end of the line
^$
empty line
^>.*
Line starting with > (e.g. Fasta header)
^[^>].*
Line not starting with > (e.g. Fasta sequence)
Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.
Regular expression
Input
Captures
chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984
\1 = 24, \2 = 1984
An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.
Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.
You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.
Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.
There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip:RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip:Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip:Cyrilex is a visual regular expression tester.
With [A-Z-] meaning any character between A to Z or -, \1 repeat the character chain between brackets in the “Find Regex” section, \n meaning a line-feed.
When you have the consensus sequences, you can check if any ambiguous nucleotide is to be found in the sequences. If you find such nucleotides, it means different nucleotides were found in the sense and antisense sequence at the same position, some checks are needed.
Y = C or T
R = A or G
W = A or T
S = G or C
K = T or G
M = C or A
Hands On: Look for ambiguous nucleotides
Click on output of Regex Find and Replacetool in the history to expand it
Click on galaxy-barchart Visualize
Select Multiple Sequence Alignment
Set color scheme to Clustal, ambiguous nucleotides are highlighted in dark blue
There are two nucleotide positions to check, Y at 121 in sequence consensus_B05_CHD8-III6brother-18 and W at 286 in sequence consensus_05_CHD8-III6mother-18
You need to go back to your FASTQ sequences to understand the origin of the ambiguity
Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
param-file“Select lines from”: #Consensus #Primer output (output of Regex Find and Replacetool)
In “Check”:
param-repeat“Insert Check”
“Find Regex”: ^[ACTG]+([ACTG]{20}Y)[ACTG]+$
“Replacement”: \1
param-repeat“Insert Check”
“Find Regex”: ^[ACTG]+([ACTG]{20}W)[ACTG]+$
“Replacement”: \1
Comment: What's going on in this step?
We want to retrieve the 20 nucleotides before the ambiguities.
We use regular expressions (Regex):
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression
Matches
abc
an occurrence of abc within your data
(abc|def)
abcordef
[abc]
a single character which is either a, b, or c
[^abc]
a character that is NOT a, b, nor c
[a-z]
any lowercase letter
[a-zA-Z]
any letter (upper or lower case)
[0-9]
numbers 0-9
\d
any digit (same as [0-9])
\D
any non-digit character
\w
any alphanumeric character
\W
any non-alphanumeric character
\s
any whitespace
\S
any non-whitespace character
.
any character
\.
{x,y}
between x and y repetitions
^
the beginning of the line
$
the end of the line
Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.
Examples
Regular expression
matches
\d{4}
4 digits (e.g. a year)
chr\d{1,2}
chr followed by 1 or 2 digits
.*abc$
anything with abc at the end of the line
^$
empty line
^>.*
Line starting with > (e.g. Fasta header)
^[^>].*
Line not starting with > (e.g. Fasta sequence)
Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.
Regular expression
Input
Captures
chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984
\1 = 24, \2 = 1984
An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.
Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.
You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.
Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.
There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip:RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip:Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip:Cyrilex is a visual regular expression tester.
With [ACTG] meaning any character of the four unambiguous nucleotides followed by + meaning “at least once in the character chain” or by {20} meaning “20 times”.
In the output of this tool we get:
- the 20 nucleotides before the Y at position 121 in sequence consensus_B05_CHD8-III6brother-18: CAGGCACGATGTCATCGAAT
- and the 20 nuleotides before the W at position 286 in sequence consensus_05_CHD8-III6mother-18: AGTCCTCTTAGTTTATAGAT
FASTQ masker ( Galaxy version 1.1.5) with the following parameters:
param-collection“File to mask”: #Forward #Reverse collection (output of FASTQ groomertool)
“Mask input with”: Lowercase
“Quality score”: 10
This tool displays low-quality bases in lowercase to permit better detection of potential errors.
Open galaxy-eyeB05_CHD8-III6brother-18 output of FASTQ maskertool and ctrl+f : CAGGCACGATGTCATCGAAT.
In the sense sequence (ID ending with 18F), this fragment is followed by a c in low-quality, whereas in the antisense sequence it is followed by a T in decent quality.
Additionally, when looking into the galaxy-eye#Consensus #Primer output of Regex Find and Replacetool, we can see the two other consensus sequences (consensus_05_CHD8-III6mother-18 and consensus_07_CHD8-III6-18) have a T at this same position.
It seems more likely that the nucleotide at position 121 in sequence consensus_B05_CHD8-III6brother-18 is a T.
Open galaxy-eye05_CHD8-III6mother-18 outputs of FASTQ maskertool and ctrl+f : AGTCCTCTTAGTTTATAGAT.
In the antisense sequence (ID ending with 18R), this fragment is followed by a t in low-quality, whereas in the sense sequence it is followed by a A in decent quality.
Additionally, when looking into the galaxy-eye#Consensus #Primer output of Regex Find and Replacetool, we can see the two other consensus sequences (consensus_B05_CHD8-III6brother-18 and consensus_07_CHD8-III6-18) have a A at this same position.
It seems more likely that the nucleotide at position 286 in sequence consensus_05_CHD8-III6mother-18 is a A.
You can now correct them by clicking on output of Regex Find and Replacetool in the history to expand it
Click on galaxy-barchart Visualize
Select Editor and:
replace manually the Y with T in consensus_B05_CHD8-III6brother-18
replace manually the W with A in consensus_05_CHD8-III6mother-18
and click on export
Now, one can align its sequences with primers. Ultimately, it is common to cut sequences between primers to get the right fragment for each sequence.
Hands On: Align sequences and primers
Align sequences ( Galaxy version 1.9.1.0) with the following parameters:
param-file“Input fasta file”: out_file1Regex Find And Replace (modified)
“Method for aligning sequences”: mafft
“Minimum percent sequence identity to closest blast hit to include sequence in alignment”: 0.1
Check your sequences belongs to the right taxonomic group by computing a BLAST on the NCBI database
Hands On: NVBI Blast
NCBI BLAST+ blastn ( Galaxy version 2.10.1+galaxy2) with the following parameters:
param-file“Nucleotide query sequence(s)”: out_file1 (output of Regex Find And Replacetool)
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
AOPEP variants as a novel cause of recessive dystonia: Generalized dystonia and dystonia-parkinsonism, 2022 Parkinsonism and related disorders 97: 52–56. 10.1016/j.parkreldis.2022.03.007
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{sequence-analysis-Manage_AB1_Sanger,
author = "Coline Royaux",
title = "Clean and manage Sanger sequences from raw files to aligned consensus (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/Manage_AB1_Sanger/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.