Genome annotation with Helixer

Overview
Creative Commons License: CC-BY Questions:
  • How to annotate an eukaryotic genome with Helixer?

  • How to evaluate and visualize annotated genomic features?

Objectives:
  • Load genome into Galaxy

  • Annotate genome with Helixer

  • Evaluate annotation quality with BUSCO

  • View annotations in JBrowse

Requirements:
Time estimation: 4 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Aug 21, 2024
Last modification: Dec 5, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00451
version Revision: 6

Annotating the eukaryotic genome represents a somewhat more complex challenge than that of prokaryotes, mainly due to the generally larger size of eukaryotic genomes and their greater number of genes, but also to the complexity of eukaryotic genes structure (e.g. exons and Untranslated region (UTR)). This annotation can be carried out at different levels of precision, ranging from simple identification of coding and non-coding parts to detailed structural labeling, including for example the precise location of exons, introns and other regulatory elements.

In this tutorial we will use a software tool called Helixer to annotate the genome sequence of a small eukaryote: Mucor mucedo (a fungal plant pathogen).

Helixer is an annotation software with a new and different approach: it performs evidence-free predictions (no need for RNASeq data or sequence aligments), using Graphics Processing Unit (GPU), with a much faster execution time. The annotation is based on the development and use of a cross-species deep learning model. The software is used to configure and train models for ab initio prediction of gene structure. In other words, it identifies the base pairs in a genome that belong to the UTR/CDS/Intron genes.

In this tutorial, you’ll learn how to perform a structural annotation of the genome and how to assess its quality.

Agenda

In this tutorial, we will cover:

  1. Data upload
  2. Structural annotation
  3. Quality evaluation
    1. General statistics
    2. Evaluation with Busco
    3. Evaluation with OMArk
  4. Visualisation with a genome browser
  5. Conclusion

Data upload

To annotate our genome using Helixer, we will use the following files:

  • The genome sequence in fasta format. For this tutorial, we will try to annotate the genome assembled in the Flye assembly tutorial. (Note: Helixer will ignore soft-masking. Hard-masking is not recommnded for Helixer either, as it does not ignore the hard-masked regions, but will get less information from them, which could influence your predictions in a negative way.)
Hands-on: Data upload
  1. Create a new history for this tutorial

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Genome annotation with Helixer):

    https://zenodo.org/record/7867921/files/genome_masked.fasta
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

Structural annotation

We can run Helixer to perform the structural annotation of the genome.

We need to input the genome sequence we want to annotate.

We also need to choose between 4 different lineages: invertebrate, vertebrate, land plant or fungi. Select the one that fits the best to the species you’re studying: fungi in our case. Helixer is shipped with these 4 models that were trained specifically to annotate genes from each of these lineages. Advanced users can upload their own lineage model in .h5 format with the “Lineage model” option.

As an option, we can also enter a species name.

Hands-on
  1. Helixer ( Galaxy version 0.3.3+galaxy1) with the following parameters:
    • param-file “Genomic sequence”: genome_masked.fasta (Input dataset)
    • In “Available lineages”: “selectfungi
    • In “Species name”: Mucor mucedo
Comment: Advanced parameters

Depending on the lineage, the parameters “Subsequence length”, “Overlap offset” and “Overlap corelength” are adjusted to corresponding default values (listed in the help of each option).

This is due in particular to the size of the genomes. Indeed, it is recommended to increase the value of “Subsequence length” for genomes containing large genes. This is particularly important for vertebrates and invertebrates.

The default values used by Galaxy are the ones recommended by Helixer authors. If you wish to modify these default values, you can do so by entering your values in the “Subsequence length”, “Overlap offset” and “Overlap corelength” parameters.

Comment: Don't wait

This step can take a bit of time to run: although Helixer runs much faster than many other annotation tools (typically <20min for this tutorial), it requires specific hardware (GPU) that is often available in limited quantity on computing systems. It means your job can be placed in queue for a longer time than a more standard Galaxy job.

While it runs, we can already schedule the following steps. Galaxy will run them automatically as soon as the Helixer annotation is ready.

Helixer produces a single output dataset: a GFF3 file. The GFF3 format is a standard bioinformatics format for storing genome annotations. Each row describes a genomic entity, with columns detailing its identifier, location, score and other attributes.

Quality evaluation

General statistics

Genome Annotation Statistics is a program designed to analyze and provide statistics on genomic annotations. This software performs its analyses from a GFF3 file.

Hands-on: Genome Annotation Statistics
  1. Genome Annotation Statistics ( Galaxy version 0.8.4) with the following parameters:
    • param-file “Annotation to analyse “: GFF3 file (Output of Helixer)
    • In “Reference genome”: select Use a genome from history
    • In “Corresponding genome sequence”: genome_masked.fasta (Input dataset)

Two output files are generated:

  • a file containing graphs in pdf format
  • a summary in txt format
Comment: What can we deduce from these results?
  • The summary file provides statistics on the genome annotation and gives a complete overview of the genomic structure and characteristics of the genes, exons and introns in the analysed genome.
  • We can see that there are 19,299 genes, 77% of which are multi-exons (i.e. 14,860) and 23% single-exons (i.e. 4,439).
  • We can obtain other information such as the average size of exons, the percentage in GC or the average size of transcripts.

These statistics are interesting on their own: you often have a rough idea of the expected number of genes or mean length when annotating a new genome, by comparing with published similar species. You can also use them to compare the quality of annotations produced by different tools.

Evaluation with Busco

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool allowing to evaluate the quality of a genome assembly or of a genome annotation. By comparing genomes from various more or less related species, the authors determined sets of ortholog genes that are present in single copy in (almost) all the species of a clade (Bacteria, Fungi, Plants, Insects, Mammalians, …). Most of these genes are essential for the organism to live, and are expected to be found in any newly sequenced and annotated genome from the corresponding clade. Using this data, BUSCO is able to evaluate the proportion of these essential genes (also named BUSCOs) found in a set of (predicted) transcript or protein sequences. This is a good evaluation of the “completeness” of the annotation.

As an alternative for genomes only one can use compleasm with the same BUSCO gene sets, as compleasm is a bit more sensitive and thus allows finding slightly more conserved genes.

We want to run BUSCO on the protein sequences predicted from gene sequences of the Helixer annotation. So first generate these sequences:

Hands-on: Extract protein sequences
  1. GFFread ( Galaxy version 2.2.1.4+galaxy0) with the following parameters:
    • param-file “Input GFF3 or GTF feature file”: output of Helixer ( Galaxy version 0.3.3+galaxy1))
    • In “Reference Genome” select: From your history (Input dataset)
    • “Genome Reference Fasta”: masked genome (Input dataset)
    • In “Select fasta outputs” select: protein fasta file with the translation of CDS for each record (-y)
    • “full GFF attribute preservation (all attributes are shown)”: Yes
    • “decode url encoded characters within attributes”: Yes
    • “warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

To run BUSCO on these protein sequences:

Hands-on: BUSCO in proteome mode
  1. Busco ( Galaxy version 5.5.0+galaxy0) with the following parameters:
    • param-file “Sequences to analyse”: gffread: pep.fa
    • “Mode”: annotated gene sets (protein)
    • “Auto-detect or select lineage?”: Select lineage
      • “Lineage”: Mucorales
    • In “Advanced Options”:
      • “Which outputs should be generated”: select all outputs

Several output files are generated:

  • short summary : statistical summary of the quality of genomic assembly or annotation, including total number of genes evaluated, percentage of complete genes, percentage of partial genes, etc.
  • full table : list of universal orthologs found in the assembled or annotated genome, with information on their completeness, location in the genome, quality score, etc.
  • missing buscos : list of orthologs not found in the genome, which may indicate gaps in assembly or annotation.
  • summary image : graphics and visualizations to visually represent the results of the evaluation, such as bar charts showing the proportion of complete, partial and missing genes.
  • GFF : contain information on gene locations, exons, introns, etc.

This gives information about the completeness of the Helixer annotation. A good idea is to compare this first result with the one you get on the initial genome sequence, and see if the annotation tool found all the genes that BUSCO finds in the raw genome sequence. So run BUSCO in genome mode:

Hands-on: BUSCO in genome mode
  1. Busco ( Galaxy version 5.5.0+galaxy0) with the following parameters:
    • param-file “Sequences to analyse”: masked genome (Input dataset)
    • “Mode”: Genome assemblies (DNA)
    • “Auto-detect or select lineage?”: Select lineage
      • “Lineage”: Mucorales
    • In “Advanced Options”:
      • “Which outputs should be generated”: select all outputs
Comment: What can we deduce from these results?
  • 94.6% of genes are complete, so the annotation is of high quality in terms of genomic completeness.
  • It is a little bit lower than what BUSCO is able to find in genome mode (95.7%), but the difference is quite small so Helixer seems to have generated a quite good result.
  • The duplication rate is low, with 1.3% and 1.5% of genes duplicated.
  • So the Helixer annotation looks like a good one, with high completeness and low duplication.

Evaluation with OMArk

OMArk is proteome quality assessment software. It provides measures of proteome completeness, characterises the consistency of all protein-coding genes with their homologues and identifies the presence of contamination by other species. OMArk is based on the OMA orthology database, from which it exploits orthology relationships, and on the OMAmer software for rapid placement of all proteins in gene families.

OMArk’s analysis is based on HOGs (Hierarchical Orthologous Groups), which play a central role in its assessment of the completeness and coherence of gene sets. HOGs make it possible to compare the genes of a given species with groups of orthologous genes conserved across a taxonomic clade.

Hands-on: OMArk on extracted protein sequences
  1. OMArk ( Galaxy version 0.3.0+galaxy2) with the following parameters:
    • param-file “Protein sequences”: gffread: pep.fa
    • “OMAmer database: select LUCA-v2.0.0
    • In “Which outputs should be generated”: select Detailed summary

The OMArk tool generated an output file in .txt format containing detailed information on the assessment of the completeness, consistency and species composition of the proteome analysed. This report includes statistics on conserved genes, the proportion of duplications, missing genes and the identification of reference lineages.

Comment: What can we deduce from these results?
  • Number of conserved HOGs: OMArk has identified a set of 5622 HOGs which are thought to be conserved in the majority of species in the Mucorineae clade.
  • 85.52% of genes are complete, so the annotation is of good quality in terms of genomic completeness.
  • Number of proteins in the whole proteome: 19 299. Of which 62.83% are present and 30.94% of the proteome does not share sufficient similarities with known gene families.
  • No contamination detected.
  • The OMArk analysis is based on the Mucorineae lineage, a more recent and specific clade than that used in the BUSCO assessment, which selected the Mucorales as the reference group.

Visualisation with a genome browser

You can visualize the annotation generated using a genomic browser like JBrowse. This browser enables you to navigate along the chromosomes of the genome and view the structure of each predicted gene.

Hands-on: JBrowse visualisation
  1. JBrowse ( Galaxy version 1.16.11+galaxy1) with the following parameters:
    • “Reference genome to display”: Use a genome from history
      • param-file “Select the reference genome”: genome_masked.fasta (Input dataset)
    • In “Track Group”:
      • param-repeat “Insert Track Group”
        • “Track Category”: Annotation
        • In “Annotation Track”:
          • param-repeat “Insert Annotation Track”
            • “Track Type”: GFF/GFF3/BED Features
              • param-file “GFF/GFF3/BED Track Data”: gff3 (output of Helixer tool)

Click on the newly created dataset’s eye to display it. You will see a JBrowse genome browser. You can have a look at the JBrowse tutorial for a more in-depth description of JBrowse.

Conclusion

Congratulations on reaching the end of this tutorial! You now know how to perform a structural annotation of a new eukaryotic genome, using Helixer. And you’ve learned how to evaluate its quality and how to visualize it using JBrowse.

If you’d like to complete this annotation, we recommend you to follow the tutorial on functional annotation with EggNOG Mapper and InterProScan. You can follow it with the protein sequences we generated earlier with gffread.