Comparative gene analysis in unannotated genomes

Authors: AvatarAnton Nekrutenko
  • I have several genomes assemblies that are not annoatated (or I do not trust annotations)

  • I am interested to compare structure of a particular gene across these genome assemblies

  • How do I do that?

  • Provide a quick method for identifying genes of interest in unannotated or newly assembled genomes

Time estimation: 30 minutes
Supporting Materials:
Last modification: Oct 27, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT


Despite the rapidly increasing number of fully assembled genomes few genomes are well annotated. This is especially true for large eukaryotic genomes with their complex gene structure and abundance of pseudogenes. And of course do not forget about the Murthy’s law: if you are interested in a particular gene the chances are that it will not be annotated in your genome of interest. In this tutorial we will demonstrate how to compare gene structures across a set of vertebrate genomes. So …

Input: What I have
  • I know the gene’s name
  • I know which species I’m interested in
  • I know where to find genomes of these species
Output: What I want
  • I work with a gene X
  • I would like to compare the structure of gene X across N genomes
  • Interactive graphs showing location of the gene across your species of choice. These will allow you to see the absence/presence of the genes across genomes, to detect potential duplications, predogenization events, re-arrangements etc.
  • Phylogenetic trees for individual exons of the gene. The trees will give you an idea of potential unusual evolutionary dynamics for the gene.

In this tutorial, we will deal with:

  1. Introduction
  2. Logic
  3. Example history
  4. Practice
    1. Step 1: Pick a gene and select genomes to analyze
    2. Step 2: Get amino acid translations for all exons of my gene of interest
    3. Step 3: Identify and upload genomes of interest
    4. Step 4: Extract amino acid sequences and genome coordinates for all ORFs
    5. Steps 5, 6, and 7: Finding matches and building trees
    6. Step 7: Looking at the trees
    7. Step 8: Generating and interpreting the comparative genome graph
  5. About the gene


The analysis follow the following logic (also see the following figure):

  1. Pick a gene and select genomes to analyze
  2. Get amino acid translations for all exons of my gene of interest
  3. Identify genomes of interest
  4. Extract amino acid sequences and genome coordinates for all possible open reading frames (ORFs) from all genomes in my set

    The following steps 5, 6, and 7 are performed using a single Galaxy workflow (grey area in the following figure).

  5. Align - Find matches between exons of the gene of interest and ORFs
  6. Intersect - Compute genome coordinates of matches
  7. Build phylogenetic trees for individual exons

    The final step is performed using Jupyter notebook

  8. Create a comparative plot showing distribution of matches across genomes of interest

A schematic diagram of the above workflow.

Example history

This example history contains results of the analysis described in this tutorial.

  1. Open the link to the shared history
  2. Click on the new-history Import history button on the top right
  3. Enter a title for the new history
  4. Click on Import

You can use this history to understand the input datasets as well as outputs of the entire analysis. The key items in the history are labelled with tags:

Input: dataset in the example history
  • EXONS - amino acid translation of exons of the gene of interest (XBP-1)
  • ORF_BED - coordinates of predicted ORFs in the genomes of interest
  • DiamondDB - database and amino acid translations of predicted ORFs in the genomes of interest
Output: in the sample history
  • PlottingData - summary necessary for plotting comparative genome graphs
  • Trees - phylogenetic trees for each exon
Warning: A suggestion!

Importing and looking around this history is very helpful for understanding how this analysis works!


Step 1: Pick a gene and select genomes to analyze

In this example we will compare structure of X-box protein 1 gene (XBP-1) across a set of five vertebrate genomes sequenced by VGP consortium (Rhie et al. 2021).

Step 2: Get amino acid translations for all exons of my gene of interest

This step is manual and is performed outside Galaxy.

The best annotated vertebrate is … human. To obtain amino acid translation of individual exons of XBP-1 you can use UCSC Genome browser. The “UCSC annotations of RefSeq RNAs (NM_* and NR_*)” track shows two alternative transcripts produced by XBP-1 : spliced and unspliced (here ‘spliced’ and ‘unspliced’ do not refer to the “normal’ splicing process. To learn more about this fascinating gene read here).


Clicking on the names of the transcripts open a UCSC page specific to that transcript. From there you can obtain genomic sequences corresponding to all exons. You can then translate these sequences using any available online tool (e.g., NCBI ORFinder).

In this particular case we did not create translation of all exons. Instead, we created translation of exon 1 and two terminal exons. The first exon is shared between the two transcripts. It will allow us to anchor the beginning of the gene. The two terminal exons are different between the two transcripts. Because of this we create two alternative translation per exon: s-p2 and s-p12 for the spliced version and u-p1 and u-p2 for unspliced transcript. A FASTA file containing all translation is shown below:


Step 3: Identify and upload genomes of interest

This tutorial has been initially designed for the analysis of data produced by the Vertebrate Genome Consortium VGP. However, it is equally suitable for any genomic sequences (including prokaryotic ones).

In this section we first show how to upload sample datasets. These datasets were intentionally made small. All subsequent steps of this tutorial are performed using these sample data. We then demonstrate how to upload full size genomes from the VGP data repository called GenomeArk.

Uploading sample data

Hands-on: Sample Data upload
  • Create a new history for this tutorial
  • Import the files from Zenodo using the following URLs:
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

These correspond to fragments of genomes from:

  • aGasCar1 - Gastrophryne carolinensis (Eastern narrow-mouth toad)
  • bTaeGut - Taeniopygia guttata (Zebra finch)
  • fScoJap - Scomber japonicus (Chub mackerel)
  • mCynVol - Cynocephalus volans (Philippine flying lemur)
  • mHomSap - Homo sapiens (Human)

Uploading VGP data from GenomeArk

Galaxy provides a direct connection to GenomeArk from its “Upload Data” tool. To access GenomeArk data you need to:

Hands-on: Upload VGP data from GenomeArk
  • Create a new history for this tutorial
  • Import genome assembly FASTA files from GenomeArk:

    • Open the file galaxy-upload Upload Data menu
    • Click on Choose remote files button at the bottom of the Upload interface
    • Choose Genome Ark
    • Pick assembles you would like to include in your analysis (use Search box on top)

Generally for a given species you want the assembly with the highest version number. For example, Human (Homo sapiens) had three assemblies at the time of writing: mHomSap1, mHomSap2, and mHomSap3. Pick mHomSap3 in this situation.

Inside each assembly folder you will see typically see different sub-folders. The one you should generally be interested in will be called assembly_curated. Inside that folder you will typically see separate assemblies for maternal (indicated with mat) and paternal (indicated with pat) haplotypes. In the analyses here we are choosing as assembly representing heterogametic haplotype (pat in the case of Human).

  • Repeat this process for all assemblies you are interested in.

Combine uploaded sequences into a dataset collection

After uploading data from GenomeArk you will end up with a number of FASTA datasets in your history. To proceed with the analysis you will need to combine these into a single dataset collection:

Hands-on: Combine genome assemblies into a dataset collection

To create a collection from genome assemblies you just uploaded into GenomeArk follow the steps in the video below. Obviously, you only want to click check-boxes next to FASTA files (assemblies) you want to include and leave everything else out.

  • Click on Operations on multiple datasets (check box icon) at the top of the history panel Operations on multiple datasets button
  • Check all the datasets in your history you would like to include
  • Click For all selected.. and choose Build dataset list

    build list collection menu item

  • Enter a name for your collection
  • Click Create List to build your collection
  • Click on the checkmark icon at the top of your history again

Step 4: Extract amino acid sequences and genome coordinates for all ORFs

Extract ORFs from genomic sequences

Now we can go ahead and find all ORFs in all genome assemblies we have bundled into a collection. For this we will use tool called ORFiPy:

Hands-on: Find ORFs witrh ORFiPy

Run ORFiPy Tool: with the following parameters:

  • param-collection “Find ORFs in”: The collection containing genome assemblies we just created (click “Dataset collection” param-collection button on left-side of this input field)
  • “Select options”: check the box next to BED and Peptides
  • “Minimum length of ORFs”: type the number 99. We do not want ORFs to be too short as their number will be too high
  • “Output ORFs bound by Stop codons?”: set this option to Yes.

Because this operation is performed on a dataset collection. it will produce a dataset collection as an output. Actually, it will produce two because we selected BED and Peptides options. One collection will contain amino acid sequences for all ORFs (it will typically be called ORFs on collection ... (FASTA format)) and another their coordinates (called ORFs on collection ... (BED format)) )

This will produce two new dataset collections in your history: one containing coordinates of ORFs and the other containing their amino acid translations. Because genomes are large these will likely be tens of millions of ORFs identified this way.

Creating Diamond database

Because we will be using the Diamond tool to find matches between our gene of interest and ORF translations we need to convert FASTA files into Diamond database using Diamond makedb tool:

Hands-on: Create Diamond database

Run Diamond makedb Tool: on the collection containing amino acid translations of ORFs generated using ORFiPy:

  • param-collection “Input reference file in FASTA format”*: Select collections containing amino acid (FASTA Protein) output of ORFiPy

At this point we have three input datasets that would allow us to find and visualize matches between the gene of interest and genome asseblies:

  1. Diamond database of ORF translations from genome assemblies
  2. Coordinates and frame information about ORFs in BED format
  3. Amino acid translation of exons from the gene of interest

Steps 5, 6, and 7: Finding matches and building trees

To find location of genes, we will use the following workflow that is available as a part of this tutorial. To use this workflow you need to import it into your Galaxy instance.


The workflow takes three inputs:

Input: Workflow inputs
  1. EXONS - Amino acid translation of exons from the gene of interest (Step 2 of this tutorial)
  2. DiamondDB - Diamond database of ORF translations from genome assemblies (Step 4 of this tutorial)
  3. ORF BED - Coordinates and frame information about ORFs in BED format (Step 4 of this tutorial)

It produces two primary outputs:

Output: Results
  1. Trees - Phylogenetic trees for each input exon as Newick file
  2. PlottingData - A summary table of exon matches for each genome

The overall logic of the workflow is as follows:

  1. Perform Diamond search to identify matches between the gene of interest (EXONS) and ORF sets from each genomes (DiamondDB).
  2. Intersect information about the matches with BED file containing ORF coordinates (ORF BED). This allows us to know genomic position of the ORFs and their frames (1, 2, 3 or -1, -2, -3).
  3. Extract matching parts of the sequences and generated multiple alignments using MAFFT. This is done for each amino acid sequence in EXONS. Thus in the case of XBP-1 there will be five sets of alignments.
  4. Build phylogenetic tree for each alignment produced in the previous step using the Neighbor Joining method - a simple and quick way to obtain a rough phylogeny.
  5. Use information from step 1 to compute genome coordinates of matches. This is done by combining the information about genomic positions of ORFs from the ORF BED file and local alignment information generated during step 1.

Step 7: Looking at the trees

After running the workflow phylogenetic trees will be saved into a collection named Join neighbors on .... This collection will also be labelled with tag Trees. To visualize the trees:

Hands-on: Visualize the trees
  • Expand the collection by clicking on it
  • Click on any dataset
  • Click the galaxy-barchart icon: Graph icon.
  • A selection of tree visualization tools will appear in the center pane of the interface
  • Select Phylogenetic Tre Visualization and play with it: The tree.

Step 8: Generating and interpreting the comparative genome graph

Another workflow output will represent a single file summarizing genomic location of matches between each of the genomes in our dataset and amino acid translation of exons from the gene of interest. It will be called Mapping report and will have tag PlottingData associated with it. To plot the data contained in this file we will use external Jupyter notebook (note that Jupyter can be run directly from Galaxy, but to make this tutorial runnable on any Galaxy instance we will use an internal notebook server).

Starting notebook

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

If you are on a server with Interactive Tool support (e.g. Galaxy Europe, Australia, and many others), you can choose that option for a more integrated Jupyter Notebook experience.

The notebook can be accessed from here. You do need to have a Google Account to be able to use the notebook.

Hands-on: Launch JupyterLab

tip Tip: Launch JupyterLab in Galaxy

Currently JupyterLab in Galaxy is available on, and

hands_on Hands-on: Run JupyterLab

  1. Interactive Jupyter Notebook Tool: interactive_tool_jupyter_notebook :
  2. Click Execute
  3. The tool will start running and will stay running permanently
  4. Click on the User menu at the top and go to Active Interactive Tools and locate the JupyterLab instance you started.
  5. Click on your JupyterLab instance

tip Tip: Launch Try JupyterLab if not available on Galaxy

If JupyterLab is not available on the Galaxy instance:

  1. Start Try JupyterLab
  1. Once the notebook has started, open a Terminal in Jupyter File → New → Terminal

  2. Run the following command:

  3. Switch to the Notebook that appears in the list of files on the left

Proving data

The data is provided to the notebook by setting the dataset_url parameter in this cell:

# Paste link to the dataset here
dataset_url = ''

This is a URL pointing to one of the workflow outputs: Mapping report with the PlottingData tag. To copy the URL click on the galaxy-link icon adjacent to the dataset:

Copying the link.

Running the notebook will generate two graphs explained in the next section.

Interpreting the graphs

Analysis of sample data associated with this tutorial will produce the genome graph shown below. In this graph the Y-axis represents ORFs on positive (1, 2, 3 in red color) and negative (-1, -2, -3 in blue color) strands. The X-axis is genomic coordinates. Boxes represent matches between amino acid sequences of exons and ORFs they are superimposed to. The color of boxes reflect the extent of amino acid identity. The color key is shown in the left upper corner of the plot. The image is interactive so you can zoom in and out.

One of the interesting things you can immediate see in this graph is that the Philippine flying lemur (mCynVol) contains two copies of XBP-1 located on scaffold_2 and scaffold_3. Furthermore, the copy located on scaffold_2 looks like a retransposed duplicate of the gene because it is shorter and likely missing some of the introns – a hallmark of retrotransposition.

Another plot currently produced by the notebook is a summary plot showing the distribution of matches across species and their amino acid identity:

Copying the link.

About the gene

XBP-1 is one of a few vertebrate gene containing overlapping reading frames encoding different proteins within a single gene (Chung et al. 2007). In this case the switch between two reading frames occurs when a highly specialized RNA endonuclease, IRE1, excises a 26 nucleotide spacer from XBP-1 mRNA (Nekrutenko and He 2006). This converts the so-called “unspliced” form of the transcript (XBP-1U) to the “spliced” form XBP-1S (Note that the term “spliced” is misleading here. The XBP-1U is already processed by splicing machinery and contains no canonical introns. The “spliced” is simply used to indicate the removal of the 26 nucleotide spacer). Because the 26 is not divisible by three, the XBP-1S transcript is translated in a different frame from the point of cleavage. The activation of IRE1 and resulting removal of the spacer is triggered by presence of unfolded proteins in the endoplasmic reticulum and the IRE1-XBP1 pathway is one of the three major unfolded protein response systems in higher eukaryotes (Chen and Cubillos-Ruiz 2021) that is conserved all the way to yeast (with XBP-1’s homologue HAC-1; Calfon et al. 2002 ) .

For questions about using Galaxy, you can ask in the Galaxy help forum.

Key points
  • You can locate gene of interest in unannotated genomes and compare its structure across multiple species

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum


  1. Calfon, M., H. Zeng, F. Urano, and J. H. Till, 2002 IRE1 couples endoplasmic reticulum load to secretory capacity by processing the XBP-1 mRNA. ….
  2. Nekrutenko, A., and J. He, 2006 Functionality of unspliced XBP1 is required to explain evolution of overlapping reading frames. Trends in Genetics 22: 645–648.
  3. Chung, W.-Y., S. Wadhawan, R. Szklarczyk, S. K. Pond, and A. Nekrutenko, 2007 A first look at ARFome: dual-coding genes in mammalian genomes. PLoS Comput. Biol. 3: e91.
  4. Chen, X., and J. R. Cubillos-Ruiz, 2021 Endoplasmic reticulum stress signals in the tumour and its microenvironment. Nat. Rev. Cancer 21: 71–88.
  5. Rhie, A., S. A. McCarthy, O. Fedrigo, J. Damas, G. Formenti et al., 2021 Towards complete and error-free genome assemblies of all vertebrate species. Nature 592: 737–746.


Open Reading Frame


Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Anton Nekrutenko, Comparative gene analysis in unannotated genomes (Galaxy Training Materials). Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

author = "Anton Nekrutenko",
title = "Comparative gene analysis in unannotated genomes (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{}",
note = "[Online; accessed TODAY]"
    doi = {10.1016/j.cels.2018.05.012},
    url = {},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}

Congratulations on successfully completing this tutorial!