How to use Custom Reference Genomes?

A reference genome contains the nucleotide sequence of the chromosomes, scaffolds, transcripts, or contigs for single species. It is representative of a specific genome assembly build or release.

There are two options for reference genomes in Galaxy.

  • Native
    • Index provided by the server administrators.
    • Found on tool forms in a drop down menu.
    • A database key is automatically assigned. See tip 1.
    • The database is what links your data to a FASTA index. Example: used with BAM data
  • Custom
    • FASTA file uploaded by users.
    • Input on tool forms then indexed at runtime by the tool.
    • An optional custom database key can be created and assigned by the user.

There are five basic steps to use a Custom Reference Genome, plus one optional.

  1. Obtain a FASTA copy of the target genome. See tip 2.
  2. Upload the genome to Galaxy and to add it as a dataset in your history.
  3. Clean up the format with the tool NormalizeFasta using the options to wrap sequence lines at 80 bases and to trim the title line at the first whitespace.
  4. Make sure the chromosome identifiers are a match for other inputs.
  5. Set a tool form’s options to use a custom reference genome from the history and select the loaded genome FASTA.
  6. (Optional) Create a custom genome build’s database that you can assign to datasets.

tip TIP 1: Avoid assigning a native database to uploaded data unless you confirmed the data are based on the same exact genome assembly or you adjusted the data to be a match first!

tip TIP 2: When choosing your reference genome, consider choosing your reference annotation at the same time. Standardize the format of both as a preparation step. Put the files in a dedicated “reference data” history for easy reuse.

