Genome annotation with Maker (short)

Author(s)	Anthony Bretaudeau
Reviewers

Overview
Questions:

How to annotate an eukaryotic genome?

How to evaluate and visualize annotated genomic features?

Objectives:

Load genome into Galaxy

Annotate genome with Maker

Evaluate annotation quality with BUSCO

View annotations in JBrowse

Requirements:

Introduction to Galaxy Analyses

Time estimation: 2 hours

Level: Intermediate Intermediate

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

Possibly Working

UseGalaxy.be

Published: Jan 12, 2021

Last modification: Feb 29, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00167

rating Rating: 5.0 (1 recent ratings, 4 all time)

version Revision: 3

Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved.

In this tutorial we will use a software tool called Maker Campbell et al. 2014 to annotate the genome sequence of a small eukaryote: Schizosaccharomyces pombe (a yeast).

Maker is able to annotate both prokaryotes and eukaryotes. It works by aligning as many evidences as possible along the genome sequence, and then reconciliating all these signals to determine probable gene structures.

The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). Maker is also able to take into account repeated elements.

Maker uses ab-initio predictors (like Augustus or SNAP) to improve its predictions: these software tools are able to make gene structure predictions by analysing only the genome sequence with a statistical model.

In this tutorial you will learn how to perform a genome annotation, and how to evaluate its quality. Finally, you will learn how to use the JBrowse genome browser to visualise the results.

More information about Maker can be found on their website.

This tutorial was inspired by the MAKER Tutorial for WGS Assembly and Annotation Winter School 2018, don’t hesitate to consult it for more information on Maker, and on how to run it with command line.

Comment: Note: Two versions of this tutorial

Because this tutorial consists of many steps, we have made two versions of it, one long and one short.

This is the shortened version. We will skip the training of ab-initio predictors and use pre-trained data instead. We will also annotate only the third chromosome of the genome. If you would like to learn how to perform the training steps, please see the longer version of tutorial

Agenda

In this tutorial, we will cover:

Data upload

Genome quality evaluation

Maker

Annotation statistics

Busco

Improving gene naming

Visualising the results

Conclusion

What’s next?

Data upload

To annotate a genome using Maker, you need the following files:

The genome sequence in fasta format
A set of transcripts or EST sequences, preferably from the same organism.
A set of protein sequences, usually from closely related species or from a curated sequence database like UniProt/SwissProt.

Maker will align the transcript and protein sequences on the genome sequence to determine gene positions.

Hands On: Data upload
Create and name a new history for this tutorial.

To create a new history simply click the new-history icon at the top of the history panel:
Import the following files from Zenodo or from the shared data library
https://zenodo.org/records/4406623/files/S_pombe_chrIII.fasta?download=1
https://zenodo.org/records/4406623/files/S_pombe_trinity_assembly.fasta?download=1
https://zenodo.org/records/4406623/files/Swissprot_no_S_pombe.fasta?download=1
https://zenodo.org/records/4406623/files/augustus_training_2.tar.gz?download=1
https://zenodo.org/records/4406623/files/snap_training_2.snaphmm?download=1
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check that the datatype for augustus_training_2.tar.gz is set to augustus

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select augustus from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

You have the following main datasets:

S_pombe_trinity_assembly.fasta contains EST sequences from S. pombe, assembled from RNASeq data with Trinity
Swissprot_no_S_pombe.fasta contains a subset of the SwissProt protein sequence database (public sequences from S. pombe were removed to stay as close as possible to real-life analysis)
S_pombe_chrIII.fasta contains only the third chromosome from the full genome of S. pombe

The other datasets will be used later in the tutorial.

Genome quality evaluation

The quality of a genome annotation is highly dependent on the quality of the genome sequences. It is impossible to obtain a good quality annotation with a poorly assembled genome sequence. Annotation tools will have trouble finding genes if the genome sequence is highly fragmented, if it contains chimeric sequences, or if there are a lot of sequencing errors.

Before running the full annotation process, you need first to evaluate the quality of the sequence. It will give you a good idea of what you can expect from it at the end of the annotation.

Hands On: Get genome sequence statistics

Fasta Statistics ( Galaxy version 1.0.1) with the following parameters:

param-file “fasta or multifasta file”: select S_pombe_chrIII.fasta from your history

Have a look at the statistics:

num_seq: the number of contigs (or scaffold or chromosomes), compare it to expected chromosome numbers
len_min, len_max, len_N50, len_mean, len_median: the distribution of contig sizes
num_bp_not_N: the number of bases that are not N, it should be as close as possible to the total number of bases (num_bp)

These statistics are useful to detect obvious problems in the genome assembly, but it gives no information about the quality of the sequence content. We want to evaluate if the genome sequence contains all the genes we expect to find in the considered species, and if their sequence are correct.

Comment

Keep in mind that we are running this tutorial only on the chromosome III instead of the whole genome.

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool allowing to answer this question: by comparing genomes from various more or less related species, the authors determined sets of ortholog genes that are present in single copy in (almost) all the species of a clade (Bacteria, Fungi, Plants, Insects, Mammalians, …). Most of these genes are essential for the organism to live, and are expected to be found in any newly sequenced genome from the corresponding clade. Using this data, BUSCO is able to evaluate the proportion of these essential genes (also named BUSCOs) found in a genome sequence or a set of (predicted) transcript or protein sequences. This is a good evaluation of the “completeness” of the genome or annotation.

We will first run this tool on the genome sequence to evaluate its quality.

Hands On: Run Busco on the genome

Busco ( Galaxy version 4.1.4) with the following parameters:

param-file “Sequences to analyse”: select S_pombe_chrIII.fasta from your history

“Mode”: Genome

“Lineage”: Fungi

Comment

We select Fungi as we will annotate the genome of Schizosaccharomyces pombe which belongs to the Fungi kingdom. It is usually better to select the most specific lineage for the species you study. Large lineages (like Metazoa) will consist of fewer genes, but with a strong support. More specific lineages (like Hymenoptera) will have more genes, but with a weaker support (has they are found in fewer genomes).

BUSCO produces three output datasets

A short summary: summarizes the results of BUSCO (see below)
A full table: lists all the BUSCOs that were searched for, with the corresponding status (was it found in the genome? how many times? where?)
A table of missing BUSCOs: this is the list of all genes that were not found in the genome

BUSCO genome summary

Question

Do you think the genome quality is good enough for performing the annotation?

The genome consists of the expected number of chromosome sequences (1), with very few N, which is the ideal case. As we only analysed chromosome III, many BUSCO genes are missing, but still ~100 are found as complete single copy, and very few are found fragmented, which means that our genome have a good quality, at least on this single chromosome. That’s a very good material to perform an annotation.

Comment

Keep in mind that we are running this tutorial only on the chromosome III instead of the whole genome. The BUSCO result will also show a lot of missing genes: it is expected as all the BUSCO genes that are not on the chromosome III cannot be found by the tool.

Maker

Let’s run Maker to predict gene models! Maker will use align ESTs and proteins to the genome, and it will run ab initio predictors (SNAP and Augustus) using pre-trained models for this organism (have a look at the longer version of tutorial to understand how they were trained).

Hands On: Annotation with Maker

Maker ( Galaxy version 2.31.11) with the following parameters:

param-file “Genome to annotate”: select S_pombe_chrIII.fasta from your history

“Organism type”: Eukaryotic

“Re-annotate using an existing Maker annotation”: No

In “EST evidences (for best results provide at least one of these)”:

param-file “ESTs or assembled cDNA”: S_pombe_trinity_assembly.fasta

In “Protein evidences (for best results provide at least one of these)”:

param-file “Protein sequences”: Swissprot_no_S_pombe.fasta

In “Ab-initio gene prediction”:

“SNAP model”: snap_training_2.snaphmm

“Prediction with Augustus”: Run Augustus with a custom prediction model

param-file “Augustus model”: augustus_training_2.tar.gz

In “Repeat masking”:

“Repeat library source”: Disable repeat masking (not recommended)

Comment

For this tutorial repeat masking is disabled, which is not the recommended setting. When doing a real-life annotation, you should either use Dfam or provide your own repeats library.

Maker produces three GFF3 datasets:

The final annotation: the final consensus gene models produced by Maker
The evidences: the alignments of all the data Maker used to construct the final annotation (ESTs and proteins that we used)
A GFF3 file containing both the final annotation and the evidences

Annotation statistics

We need now to evaluate this annotation produced by Maker.

First, use the Genome annotation statistics that will compute some general statistics on the annotation.

Hands On: Get annotation statistics

Genome annotation statistics ( Galaxy version 0.8.4) with the following parameters:

param-file “Annotation to analyse”: final annotation (output of Maker ( Galaxy version 2.31.11))

“Reference genome”: Use a genome from history

param-file “Corresponding genome sequence”: select S_pombe_chrIII.fasta from your history

Question

How many genes where predicted by Maker?

What is the mean gene locus size of these genes?

864 genes

1793 bp

Busco

Just as we did for the genome at the beginning, we can use BUSCO to check the quality of this Maker annotation. Instead of looking for known genes in the genome sequence, BUSCO will inspect the transcript sequences of the genes predicted by Maker. This will allow us to see if Maker was able to properly identify the set of genes that Busco found in the genome sequence at the beginning of this tutorial.

First we need to compute all the transcript sequences from the Maker annotation, using GFFread ( Galaxy version 2.2.1.1). This tool will compute the sequence of each transcript that was predicted by Maker ( Galaxy version 2.31.11) and write them all in a FASTA file.

Hands On: Extract transcript sequences

GFFread ( Galaxy version 2.2.1.1) with the following parameters:

param-file “Input GFF3 or GTF feature file”: final annotation (output of Maker ( Galaxy version 2.31.11))

“Reference Genome”: select S_pombe_chrIII.fasta from your history

“Select fasta outputs”:

fasta file with spliced exons for each GFF transcript (-w exons.fa)

“full GFF attribute preservation (all attributes are shown)”: Yes

“decode url encoded characters within attributes”: Yes

“warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

Now run BUSCO with the predicted transcript sequences:

Hands On: Run BUSCO

Busco ( Galaxy version 4.1.4) with the following parameters:

param-file “Sequences to analyse”: exons (output of GFFread ( Galaxy version 2.2.1.1))

“Mode”: Transcriptome

“Lineage”: Fungi

Question

How do the BUSCO statistics compare to the ones at the genome level?

128 complete single-copy, 0 duplicated, 10 fragmented, 620 missing. This is in fact better than what BUSCO found in the genome sequence. That means the quality of this annotation is very good (by default BUSCO in genome mode can miss some genes, the advanced options can improve this at the cost of computing time). (Results can be very slightly different in your own history, it’s normal).

Improving gene naming

If you look at the content of the final annotation dataset, you will notice that the gene names are long, complicated, and not very readable. That’s because Maker assign them automatic names based on the way it computed each gene model. We are now going to automatically assign more readable names.

Hands On: Change gene names

Map annotation ids ( Galaxy version 2.31.11) with the following parameters:

param-file “Maker annotation where to change ids”: final annotation (output of Maker ( Galaxy version 2.31.11))

“Prefix for ids”: TEST_

“Justify numeric ids to this length”: 6

Comment

Genes will be renamed to look like: TEST_001234. You can replace TEST_ by anything you like, usually an uppercase short prefix.

Look at the generated dataset, it should be much more readable, and ready for an official release.

Visualising the results

With Galaxy, you can visualize the annotation you have generated using JBrowse. This allows you to navigate along the chromosomes of the genome and see the structure of each predicted gene.

Hands On: Visualize annotations in JBrowse

JBrowse ( Galaxy version 1.16.10+galaxy0) with the following parameters:

“Reference genome to display”: Use a genome from history

param-file “Select the reference genome”: select S_pombe_chrIII.fasta from your history

“JBrowse-in-Galaxy Action”: New JBrowse Instance

In “Track Group”:

Click on “Insert Track Group”:

In “1: Track Group”:

“Track Category”: Maker annotation

In “Annotation Track”:

Click on “Insert Annotation Track”:

In “1: Annotation Track”:

“Track Type”: GFF/GFF3/BED Features

param-files “GFF/GFF3/BED Track Data”: select the output of Map annotation ids ( Galaxy version 2.31.11)

Enable the track on the left side of JBrowse, then navigate along the genome and look at the genes that were predicted by Maker.

Conclusion

Congratulations, you finished this tutorial! You learned how to annotate an eukaryotic genome using Maker, how to evaluate the quality of the annotation, and how to visualize it using the JBrowse genome browser.

What’s next?

After generating your annotation, you will probably want to automatically assign functional annotation to each predicted gene model. You can do it by using Blast, InterProScan, or Blast2GO for example.

An automatic annotation of an eukaryotic genome is rarely perfect. If you inspect some predicted genes, you will probably find some mistakes made by Maker, e.g. wrong exon/intron limits, splitted genes, or merged genes. Setting up a manual curation project using Apollo helps a lot to manually fix these errors. Check out the Apollo tutorial for more details.

You've Finished the Tutorial

Key points

Maker allows to annotate a eukaryotic genome.

BUSCO and JBrowse allow to inspect the quality of an annotation.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Campbell, M. S., C. Holt, B. Moore, and M. Yandell, 2014 Genome annotation and curation using MAKER and MAKER-P. Current Protocols in Bioinformatics 48: 4–11. 10.1002/0471250953.bi0411s48

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Anthony Bretaudeau, Genome annotation with Maker (short) (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker-short/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{genome-annotation-annotation-with-maker-short,
author = "Anthony Bretaudeau",
	title = "Genome annotation with Maker (short) (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker-short/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/genome-annotation/tutorials/annotation-with-maker-short/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: gffread
  owner: devteam
  revisions: 0232f19d300f
  tool_panel_section_label: Convert Formats
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: busco
  owner: iuc
  revisions: 1440ae06552f
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: jbrowse
  owner: iuc
  revisions: 2bb2e07a7a21
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: jcvi_gff_stats
  owner: iuc
  revisions: 8cffbd184762
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: maker
  owner: iuc
  revisions: 96ac930d84fa
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: maker_map_ids
  owner: iuc
  revisions: 326f9c294b09
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: fasta_stats
  owner: simon-gladman
  revisions: 20ca2574216a
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 3

t{ hist[0] | to_stars }} 1

January 2025

5 stars: Liked: Broad scope Disliked: There is an incongruency between results shown and text when comparing BUSCO results on the genome and the "transcriptome" modes (for some reason, the text suggests only ~100 genes were found as single copy for the genome when the actual number was larger, almost 270)

October 2022

3 stars: Liked: What I liked most about this tutorial was the existence of explanations that helped in the interpretation of the various outputs produced by the software. Disliked: The tutorial could be updated because I found that there are discrepancies between the tutorial and the Galaxy platform. Namely, the name of some of the tools and their location on the website differ between the tutorial and the current platform.