Genome annotation with Maker (short)

Author(s) orcid logoAnthony Bretaudeau avatar Anthony Bretaudeau
Reviewers Björn Grüning avatarBérénice Batut avatarAnthony Bretaudeau avatarHelena Rasche avatar
Overview
Creative Commons License: CC-BY Questions:
  • How to annotate an eukaryotic genome?

  • How to evaluate and visualize annotated genomic features?

Objectives:
  • Load genome into Galaxy

  • Annotate genome with Maker

  • Evaluate annotation quality with BUSCO

  • View annotations in JBrowse

Requirements:
Time estimation: 2 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Jan 12, 2021
Last modification: Feb 29, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00167
rating Rating: 4.3 (0 recent ratings, 3 all time)
version Revision: 3

Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved.

In this tutorial we will use a software tool called Maker Campbell et al. 2014 to annotate the genome sequence of a small eukaryote: Schizosaccharomyces pombe (a yeast).

Maker is able to annotate both prokaryotes and eukaryotes. It works by aligning as many evidences as possible along the genome sequence, and then reconciliating all these signals to determine probable gene structures.

The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). Maker is also able to take into account repeated elements.

Maker uses ab-initio predictors (like Augustus or SNAP) to improve its predictions: these software tools are able to make gene structure predictions by analysing only the genome sequence with a statistical model.

In this tutorial you will learn how to perform a genome annotation, and how to evaluate its quality. Finally, you will learn how to use the JBrowse genome browser to visualise the results.

More information about Maker can be found on their website.

This tutorial was inspired by the MAKER Tutorial for WGS Assembly and Annotation Winter School 2018, don’t hesitate to consult it for more information on Maker, and on how to run it with command line.

Comment: Note: Two versions of this tutorial

Because this tutorial consists of many steps, we have made two versions of it, one long and one short.

This is the shortened version. We will skip the training of ab-initio predictors and use pre-trained data instead. We will also annotate only the third chromosome of the genome. If you would like to learn how to perform the training steps, please see the longer version of tutorial

Agenda

In this tutorial, we will cover:

  1. Data upload
  2. Genome quality evaluation
  3. Maker
  4. Annotation statistics
  5. Busco
  6. Improving gene naming
  7. Visualising the results
  8. Conclusion
  9. What’s next?

Data upload

To annotate a genome using Maker, you need the following files:

  • The genome sequence in fasta format
  • A set of transcripts or EST sequences, preferably from the same organism.
  • A set of protein sequences, usually from closely related species or from a curated sequence database like UniProt/SwissProt.

Maker will align the transcript and protein sequences on the genome sequence to determine gene positions.

Hands-on: Data upload
  1. Create and name a new history for this tutorial.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the following files from Zenodo or from the shared data library

    https://zenodo.org/records/4406623/files/S_pombe_chrIII.fasta?download=1
    https://zenodo.org/records/4406623/files/S_pombe_trinity_assembly.fasta?download=1
    https://zenodo.org/records/4406623/files/Swissprot_no_S_pombe.fasta?download=1
    https://zenodo.org/records/4406623/files/augustus_training_2.tar.gz?download=1
    https://zenodo.org/records/4406623/files/snap_training_2.snaphmm?download=1
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
  4. Check that the datatype for augustus_training_2.tar.gz is set to augustus

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select augustus from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

You have the following main datasets:

  • S_pombe_trinity_assembly.fasta contains EST sequences from S. pombe, assembled from RNASeq data with Trinity
  • Swissprot_no_S_pombe.fasta contains a subset of the SwissProt protein sequence database (public sequences from S. pombe were removed to stay as close as possible to real-life analysis)
  • S_pombe_chrIII.fasta contains only the third chromosome from the full genome of S. pombe

The other datasets will be used later in the tutorial.

Genome quality evaluation

The quality of a genome annotation is highly dependent on the quality of the genome sequences. It is impossible to obtain a good quality annotation with a poorly assembled genome sequence. Annotation tools will have trouble finding genes if the genome sequence is highly fragmented, if it contains chimeric sequences, or if there are a lot of sequencing errors.

Before running the full annotation process, you need first to evaluate the quality of the sequence. It will give you a good idea of what you can expect from it at the end of the annotation.

Hands-on: Get genome sequence statistics
  1. Fasta Statistics ( Galaxy version 1.0.1) with the following parameters:
    • param-file “fasta or multifasta file”: select S_pombe_chrIII.fasta from your history

Have a look at the statistics:

  • num_seq: the number of contigs (or scaffold or chromosomes), compare it to expected chromosome numbers
  • len_min, len_max, len_N50, len_mean, len_median: the distribution of contig sizes
  • num_bp_not_N: the number of bases that are not N, it should be as close as possible to the total number of bases (num_bp)

These statistics are useful to detect obvious problems in the genome assembly, but it gives no information about the quality of the sequence content. We want to evaluate if the genome sequence contains all the genes we expect to find in the considered species, and if their sequence are correct.

Comment

Keep in mind that we are running this tutorial only on the chromosome III instead of the whole genome.

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool allowing to answer this question: by comparing genomes from various more or less related species, the authors determined sets of ortholog genes that are present in single copy in (almost) all the species of a clade (Bacteria, Fungi, Plants, Insects, Mammalians, …). Most of these genes are essential for the organism to live, and are expected to be found in any newly sequenced genome from the corresponding clade. Using this data, BUSCO is able to evaluate the proportion of these essential genes (also named BUSCOs) found in a genome sequence or a set of (predicted) transcript or protein sequences. This is a good evaluation of the “completeness” of the genome or annotation.

We will first run this tool on the genome sequence to evaluate its quality.

Hands-on: Run Busco on the genome
  1. Busco ( Galaxy version 4.1.4) with the following parameters:
    • param-file “Sequences to analyse”: select S_pombe_chrIII.fasta from your history
    • “Mode”: Genome
    • “Lineage”: Fungi
    Comment

    We select Fungi as we will annotate the genome of Schizosaccharomyces pombe which belongs to the Fungi kingdom. It is usually better to select the most specific lineage for the species you study. Large lineages (like Metazoa) will consist of fewer genes, but with a strong support. More specific lineages (like Hymenoptera) will have more genes, but with a weaker support (has they are found in fewer genomes).

BUSCO produces three output datasets

  • A short summary: summarizes the results of BUSCO (see below)
  • A full table: lists all the BUSCOs that were searched for, with the corresponding status (was it found in the genome? how many times? where?)
  • A table of missing BUSCOs: this is the list of all genes that were not found in the genome

BUSCO genome summary

Question

Do you think the genome quality is good enough for performing the annotation?

The genome consists of the expected number of chromosome sequences (1), with very few N, which is the ideal case. As we only analysed chromosome III, many BUSCO genes are missing, but still ~100 are found as complete single copy, and very few are found fragmented, which means that our genome have a good quality, at least on this single chromosome. That’s a very good material to perform an annotation.

Comment

Keep in mind that we are running this tutorial only on the chromosome III instead of the whole genome. The BUSCO result will also show a lot of missing genes: it is expected as all the BUSCO genes that are not on the chromosome III cannot be found by the tool.

Maker

Let’s run Maker to predict gene models! Maker will use align ESTs and proteins to the genome, and it will run ab initio predictors (SNAP and Augustus) using pre-trained models for this organism (have a look at the longer version of tutorial to understand how they were trained).

Hands-on: Annotation with Maker
  1. Maker ( Galaxy version 2.31.11) with the following parameters:
    • param-file “Genome to annotate”: select S_pombe_chrIII.fasta from your history
    • “Organism type”: Eukaryotic
    • “Re-annotate using an existing Maker annotation”: No
    • In “EST evidences (for best results provide at least one of these)”:
      • param-file “ESTs or assembled cDNA”: S_pombe_trinity_assembly.fasta
    • In “Protein evidences (for best results provide at least one of these)”:
      • param-file “Protein sequences”: Swissprot_no_S_pombe.fasta
    • In “Ab-initio gene prediction”:
      • “SNAP model”: snap_training_2.snaphmm
      • “Prediction with Augustus”: Run Augustus with a custom prediction model
        • param-file “Augustus model”: augustus_training_2.tar.gz
    • In “Repeat masking”:
      • “Repeat library source”: Disable repeat masking (not recommended)
    Comment

    For this tutorial repeat masking is disabled, which is not the recommended setting. When doing a real-life annotation, you should either use Dfam or provide your own repeats library.

Maker produces three GFF3 datasets:

  • The final annotation: the final consensus gene models produced by Maker
  • The evidences: the alignments of all the data Maker used to construct the final annotation (ESTs and proteins that we used)
  • A GFF3 file containing both the final annotation and the evidences

Annotation statistics

We need now to evaluate this annotation produced by Maker.

First, use the Genome annotation statistics that will compute some general statistics on the annotation.

Hands-on: Get annotation statistics
  1. Genome annotation statistics ( Galaxy version 0.8.4) with the following parameters:
    • param-file “Annotation to analyse”: final annotation (output of Maker ( Galaxy version 2.31.11))
    • “Reference genome”: Use a genome from history
      • param-file “Corresponding genome sequence”: select S_pombe_chrIII.fasta from your history
Question
  1. How many genes where predicted by Maker?
  2. What is the mean gene locus size of these genes?
  1. 864 genes
  2. 1793 bp

Busco

Just as we did for the genome at the beginning, we can use BUSCO to check the quality of this Maker annotation. Instead of looking for known genes in the genome sequence, BUSCO will inspect the transcript sequences of the genes predicted by Maker. This will allow us to see if Maker was able to properly identify the set of genes that Busco found in the genome sequence at the beginning of this tutorial.

First we need to compute all the transcript sequences from the Maker annotation, using GFFread ( Galaxy version 2.2.1.1). This tool will compute the sequence of each transcript that was predicted by Maker ( Galaxy version 2.31.11) and write them all in a FASTA file.

Hands-on: Extract transcript sequences
  1. GFFread ( Galaxy version 2.2.1.1) with the following parameters:
    • param-file “Input GFF3 or GTF feature file”: final annotation (output of Maker ( Galaxy version 2.31.11))
    • “Reference Genome”: select S_pombe_chrIII.fasta from your history
      • “Select fasta outputs”:
        • fasta file with spliced exons for each GFF transcript (-w exons.fa)
    • “full GFF attribute preservation (all attributes are shown)”: Yes
    • “decode url encoded characters within attributes”: Yes
    • “warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

Now run BUSCO with the predicted transcript sequences:

Hands-on: Run BUSCO
  1. Busco ( Galaxy version 4.1.4) with the following parameters:
    • param-file “Sequences to analyse”: exons (output of GFFread ( Galaxy version 2.2.1.1))
    • “Mode”: Transcriptome
    • “Lineage”: Fungi
Question

How do the BUSCO statistics compare to the ones at the genome level?

128 complete single-copy, 0 duplicated, 10 fragmented, 620 missing. This is in fact better than what BUSCO found in the genome sequence. That means the quality of this annotation is very good (by default BUSCO in genome mode can miss some genes, the advanced options can improve this at the cost of computing time). (Results can be very slightly different in your own history, it’s normal).

Improving gene naming

If you look at the content of the final annotation dataset, you will notice that the gene names are long, complicated, and not very readable. That’s because Maker assign them automatic names based on the way it computed each gene model. We are now going to automatically assign more readable names.

Hands-on: Change gene names
  1. Map annotation ids ( Galaxy version 2.31.11) with the following parameters:
    • param-file “Maker annotation where to change ids”: final annotation (output of Maker ( Galaxy version 2.31.11))
    • “Prefix for ids”: TEST_
    • “Justify numeric ids to this length”: 6
    Comment

    Genes will be renamed to look like: TEST_001234. You can replace TEST_ by anything you like, usually an uppercase short prefix.

Look at the generated dataset, it should be much more readable, and ready for an official release.

Visualising the results

With Galaxy, you can visualize the annotation you have generated using JBrowse. This allows you to navigate along the chromosomes of the genome and see the structure of each predicted gene.

Hands-on: Visualize annotations in JBrowse
  1. JBrowse ( Galaxy version 1.16.10+galaxy0) with the following parameters:
    • “Reference genome to display”: Use a genome from history
      • param-file “Select the reference genome”: select S_pombe_chrIII.fasta from your history
    • “JBrowse-in-Galaxy Action”: New JBrowse Instance
    • In “Track Group”:
      • Click on “Insert Track Group”:
      • In “1: Track Group”:
        • “Track Category”: Maker annotation
        • In “Annotation Track”:
          • Click on “Insert Annotation Track”:
          • In “1: Annotation Track”:
            • “Track Type”: GFF/GFF3/BED Features

            • param-files “GFF/GFF3/BED Track Data”: select the output of Map annotation ids ( Galaxy version 2.31.11)

Enable the track on the left side of JBrowse, then navigate along the genome and look at the genes that were predicted by Maker.

Conclusion

Congratulations, you finished this tutorial! You learned how to annotate an eukaryotic genome using Maker, how to evaluate the quality of the annotation, and how to visualize it using the JBrowse genome browser.

What’s next?

After generating your annotation, you will probably want to automatically assign functional annotation to each predicted gene model. You can do it by using Blast, InterProScan, or Blast2GO for example.

An automatic annotation of an eukaryotic genome is rarely perfect. If you inspect some predicted genes, you will probably find some mistakes made by Maker, e.g. wrong exon/intron limits, splitted genes, or merged genes. Setting up a manual curation project using Apollo helps a lot to manually fix these errors. Check out the Apollo tutorial for more details.