  • Have an understanding of the way in which Galaxy stores and uses reference data

  • Be able to download and use data managers to add a reference genome and its pre-calculated indices into the Galaxy reference data system

  • Use an Ansible playbook for all of the above

The problem

The Galaxy server administrator needed to know how to update each type of reference data, how to run the indexers, where to get the data from, and how to update Galaxy’s Tool Data Table and location configuration files.

Data managers to the rescue

Data Managers are a special class of Galaxy tool which allows for the download and/or creation of data that is stored within Tool Data Tables and their underlying flat location (e.g. .loc) files. These tools handle the creation of indices and the addition of entries/lines to the data table / .loc file via the Galaxy admin interface.

Data Managers can be defined locally in the data manager and tool data table configuration files or installed through the Tool Shed. When Data Managers are installed from the Tool Shed, their configuration is added to the shed versions of the data manager configuration and tool data table configuration files.

They are a flexible framework for adding reference data to Galaxy (not just genomic data). They are workflow compatible and can run via the Galaxy API. Data managers automatically update the appropriate location files when new data are installed.

For more background details on data managers, check out:

Comment: Pre-built data are available

The usegalaxy.* servers and Galaxy Community have a large amount of reference data online and available for use by your Galaxy server. For instructions on how to access and use these data, see the Reference Data with CVMFS tutorial.

If your data are not available as part of the CVMFS repository, Galaxy Data Managers can be used to locally install and build reference data.

Galaxy Admin Training Path

Galaxy Reference Data Components

As discussed in the overview, Galaxy Data Managers handle the population of numerous Galaxy components. From the highest level to the lowest:

  • Addition of tool data table entries into the shed version of the tool data table configuration file, shed_tool_data_table_conf.xml. Multiple data tables with the same name attribute are automatically merged by Galaxy.
  • Population of the corresponding location (.loc) file when new reference data are installed. The .loc file used corresponds to the Data Manager and its specific version that was executed to install the data. The path to the location file can be found in the Data Manager’s entry in the tool data table configuration file (above).
  • Generation of the reference data on disk, underneath the directory specified by tool_data_path in galaxy.yml.

Installing and Using Data Managers with the Galaxy UI

Install a the Fetch Genome Data Manager from the Tool Shed

Comment: Galaxy Configuration

No special configuration for Galaxy beyond creating an admin user is necessary in order to use Data Managers. However, you may want to be aware of the enable_data_manager_user_view option, which enables unprivileged user to browse data installed via Data Managers. This option is not required, and users will be able to use data installed via Data Managers regardless.

Data are installed in the path specified by tool_data_path in galaxy.yml. If you prefer to keep hand-managed reference data separate from DM-managed reference data, you can set galaxy_data_manager_data_path instead.

This hands-on exercise installs data managers and reference data through the Galaxy UI, but you are encouraged to install tools in a deterministic, recordable way through the use of Ephemeris, which is described in the Installing and Using Data Managers from the command line with Ephemeris section below.

We will install a data manager that can fetch the various genome sequences from multiple sources.

Hands On: Install the Fetch Genome Data Manager
  1. Access the Admin menu from the top bar (you need to be logged-in with an email specified in the admin_users setting)
  2. Click Install and Uninstall, which can be found on the left, under Tool Management
  3. Enter fetch_genome in the search interface
  4. Click on the first hit, having devteam as owner
  5. Click the Install button for the latest revision

View in the file system where the various elements land. Have a look in the configuration files located in config directory.


What did this tool installation change?

  • The data manager and its data tables are added to the Galaxy-managed “shed” versions of the data manager config (/srv/galaxy/var/config/shed_data_manager_conf.xml) and data table config (/srv/galaxy/var/config/shed_tool_data_table_conf.xml)
  • The data manager tool is installed along side other Galaxy tools in the shed tools directory
Code In: Bash

Let’s investigate the data manager config file.

cat /srv/galaxy/var/config/shed_data_manager_conf.xml
Code Out: Bash
<data_managers><data_manager id="fetch_genome_all_fasta_dbkeys" guid="" shed_conf_file="/srv/galaxy/var/config/shed_tool_conf.xml">
        <tool file="" guid=""><tool_shed></tool_shed><repository_name>data_manager_fetch_genome_dbkeys_all_fasta</repository_name><repository_owner>devteam</repository_owner><installed_changeset_revision>4d3eff1bc421</installed_changeset_revision><id></id><version>0.0.4</version></tool><data_table name="all_fasta">
                <column name="value"/>
                <column name="dbkey"/>
                <column name="name"/>
                <column name="path" output_ref="out_file">
                    <move type="file">
                        <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/seq/${path}</target>
                    <value_translation type="function">abspath</value_translation>
        <data_table name="__dbkeys__">
                <column name="value"/>
                <column name="name"/>
                <column name="len_path" output_ref="out_file">
                    <move type="file">
                        <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${value}/len/${len_path}</target>
                    <value_translation type="function">abspath</value_translation>
Code In: Bash

Let’s also investigate the tool data table config file.

cat /srv/galaxy/var/config/shed_tool_data_table_conf.xml
Code Out: Bash
<?xml version="1.0" ?>
    <table name="all_fasta" comment_char="#">
        <columns>value, dbkey, name, path</columns>
        <file path="/srv/galaxy/var/tool-data/"/>
    <table name="__dbkeys__" comment_char="#">
        <columns>value, name, len_path</columns>
        <file path="/srv/galaxy/var/tool-data/"/>

Download and install a reference genome sequence

Next, we will install some reference data. Specifically, we will grab sacCer2 (version 2 of the Saccharomyces cerevisiae genome).

Hands On: Download and install sacCer2
  1. Access the Admin menu from the top bar
  2. Click Local Data, which can be found on the left, under Server

    You should see something like this:

    nearly empty data manager tool list in Galaxy.

    If you instead see the message “You do not currently have any Data Managers installed,” you may need to restart your Galaxy server.

  3. Click all_fasta under View Tool Data Table Entries

    You should see the current contents of tool-data/all_fasta.loc, which will be empty.

  4. Click Local Data to return to the previous page

  5. Click Create DBKey and Reference Genome - fetching ( Galaxy version 0.0.4) under Installed Data Managers.
    • param-select “Use existing dbkey or create a new one”: Existing
    • param-select “DBKEY to assign to data”: sacCer2
    • param-text “Name of sequence”: S. cerevisiae June 2008 (SGD/sacCer2)
    • param-select “Choose the source for the reference genome”: UCSC

    The data manager tool form fields correspond to the columns in the tool data table, as referenced in shed_tool_data_table_conf.xml.

    Tool Field Tool Data Table Entry Purpose
    ID for sequence value column in all_fasta Identifier of specific build of the genome, if it differs from the DBKEY. Used for variant builds such as the hg19female build of hg19.
    DBKEY to assign to data dbkey column in all_fasta Identifier of the genome, the UCSC build ID for UCSC builds like hg19.
    Name of sequence name column in all_fasta Display name of the genome, displayed in Galaxy tool dropdowns.
  6. Click Execute. In your history, you will see a new dataset for the data manager run. When the job has finished, go back to the Data Manager view on the Galaxy Admin page (Click Local Data).
  7. Click all_fasta under View Tool Data Table Entries

    You should see that sacCer2 has been added to all_fasta.

    populated all_fasta data table which now includes sacCer2 in the genome list.

View in the file system where the changes to the location file and the reference genome fasta file.


What did this data manager execution change?

  • An entry for the sacCer2 genome has been added to the all_fasta.loc file for the data_manager_fetch_genome_all_fasta_dbkey data manager. The path to the location file can be found in “shed” versions of the data manager config (/srv/galaxy/var/config/shed_data_manager_conf.xml) and data table config (/srv/galaxy/var/config/shed_tool_data_table_conf.xml)
  • The data manager tool is installed along side other Galaxy tools in the shed tools directory
Code In: Bash

Get the path to the all_fasta.loc file:

grep 'data_manager_fetch_genome_dbkeys_all_fasta.*all_fasta.loc' /srv/galaxy/var/config/shed_tool_data_table_conf.xml
Code Out: Bash
        <file path="/srv/galaxy/var/tool-data/"/>
Code In: Bash

Let’s investigate the contents of the location file identified above (the path to yours may be different if you installed a different version of the data manager):

cat /srv/galaxy/var/tool-data/
Code Out: Bash
#This file lists the locations and dbkeys of all the fasta files
#under the "genome" directory (a directory that contains a directory
#for each build). The script will generate the file
#all_fasta.loc. This file has the format (white space characters are
#TAB characters):
#<unique_build_id>  <dbkey>     <display_name>  <file_path>
#So, all_fasta.loc could look something like this:
#apiMel3    apiMel3 Honeybee (Apis mellifera): apiMel3      /path/to/genome/apiMel3/apiMel3.fa
#hg19canon  hg19        Human (Homo sapiens): hg19 Canonical        /path/to/genome/hg19/hg19canon.fa
#hg19full   hg19        Human (Homo sapiens): hg19 Full         /path/to/genome/hg19/hg19full.fa
#Your all_fasta.loc file should contain an entry for each individual
#fasta file. So there will be multiple fasta files for each build,
#such as with hg19 above.
sacCer2	sacCer2	S. cerevisiae June 2008 (SGD/sacCer2)	/srv/galaxy/var/tool-data/sacCer2/seq/sacCer2.fa
Code In: Bash

Finally, verify that we have some sequences in the genome fasta file:

grep '^>' /srv/galaxy/var/tool-data/sacCer2/seq/sacCer2.fa
Code Out: Bash

Download and install the BWA data manager

Having the genome is a prerequisite for our ultimate goal, which is to use the sacCer2 genome as a reference genome for the BWA tool. BWA, like many tools, needs an index of the reference genome, and has its own format for that index. Thankfully, the BWA/BWA-MEM data manager will build that index for us.

In this part we will repeat the same process as when we installed the Fetch Genome data manager, except that we will install the BWA/BWA-MEM data manager this time.

Hands On: Install the BWA/BWA-MEM Data Manager
  1. Access the Admin menu from the top bar
  2. Click Install and Uninstall, which can be found on the left, under Tool Management
  3. Enter bwa_mem_index in the search interface
  4. Click on the first hit, having devteam as owner
  5. Click the Install button for the latest revision

Build the BWA index for a reference genome

In this part we will actually build the BWA index for sacCer2. It will automatically be added to our list of available reference genomes in the BWA tool.

Hands On: Build the sacCer2 BWA index
  1. Access the Admin menu from the top bar
  2. Click Local Data, which can be found on the left, under Server
  3. Click BWA-MEM index - builder ( Galaxy version 0.0.5) under Installed Data Managers.
    • param-select “Source Fasta Sequence”: S. cerevisiae June 2008 (SGD/sacCer2)
  4. Click Execute.
  5. Verify that the new BWA index for sacCer2 has been built and the .loc file has been filled in. From the Local Data page in the Admin section, click on bwa_mem_indexes under View Tool Data Table Entries

    S. cerevisiae sacCer2 should now appear in the list!


What changes were made by the BWA-MEM index builder?

  • An entry for the sacCer2 BWA/BWA-MEM index has been added to the bwa_mem_index.loc file for the data_manager_bwa_mem_index_builder data manager.
  • A symlink to the sacCer2 genome fasta file installed by the Fetch Genome data manager above was created in the sacCer2 BWA index directory, /srv/galaxy/var/tool-data/sacCer2/bwa_mem_index/sacCer2/.
  • The BWA/BWA-MEM index for the sacCer2 genome has been built.
Code In: Bash
ls -l /srv/galaxy/var/tool-data/sacCer2/bwa_mem_index/sacCer2
Code Out: Bash
total 20800
lrwxrwxrwx 1 galaxy galaxy       20 Dec  9 18:29 sacCer2.fa -> ../../seq/sacCer2.fa
-rw-r--r-- 1 galaxy galaxy       14 Dec  9 18:29 sacCer2.fa.amb
-rw-r--r-- 1 galaxy galaxy      591 Dec  9 18:29 sacCer2.fa.ann
-rw-r--r-- 1 galaxy galaxy 12163076 Dec  9 18:29 sacCer2.fa.bwt
-rw-r--r-- 1 galaxy galaxy  3040750 Dec  9 18:29 sacCer2.fa.pac
-rw-r--r-- 1 galaxy galaxy  6081552 Dec  9 18:29

Installing and Using Data Managers from the command line with Ephemeris

The same process described in the previous section can also be performed from the command line, e.g. in a CI/CD pipeline, using Ephemeris. For a more in-depth look at Ephemeris, especially in the tool installation context, please see the Galaxy Tool Management with Ephemeris tutorial.

In order to accomplish this, you will need:

  • The URL of your Galaxy server
  • The API key for your account, which must be an admin

Galaxy admin accounts are specified as a comma-separated email list in the admin_users directive of galaxy.yml . If you have set up your Galaxy server using the Galaxy Installation with Ansible tutorial, this is set to

  1. In your browser, open your Galaxy homepage
  2. Log in, or register a new account, if it’s the first time you’re logging in
  3. Go to User -> Preferences in the top menu bar, then click on Manage API key
  4. If there is no current API key available, click on Create a new key to generate it
  5. Copy your API key to somewhere convenient, you will need it throughout this tutorial

Install a the Fetch Genome Data Manager from the Tool Shed

Hands On: Install the Fetch Genome Data Manager with Ephemeris
  1. Re-activate the virtualenv you created for the ephemeris tool management tutorial.

    Code In: Bash
    . ~/ephemeris_venv/bin/activate

    then you might need to re-run the steps:

    python3 -m venv ~/ephemeris_venv
    . ~/ephemeris_venv/bin/activate
    pip install ephemeris
  2. Install the data_manager_fetch_genome_dbkeys_all_fasta data manager tool owned by devteam.

    Code In: Bash

    Be sure to adjust the value of -g appropriately for your Galaxy server, and replace the value of -a with your API key.

    shed-tools install -g -a <api-key> --name data_manager_fetch_genome_dbkeys_all_fasta --owner devteam
    Code Out
    Storing log file in: /tmp/ephemeris_x9xeu8ro
    (1/1) Installing repository data_manager_fetch_genome_dbkeys_all_fasta from devteam to section "None" at revision 4d3eff1bc421 (TRT: 0:00:00.401143)
    	repository data_manager_fetch_genome_dbkeys_all_fasta installed successfully (in 0:00:25.530604) at revision 4d3eff1bc421
    Installed repositories (1): [('data_manager_fetch_genome_dbkeys_all_fasta', '4d3eff1bc421')]
    Skipped repositories (0): []
    Errored repositories (0): []
    All repositories have been installed.
    Total run time: 0:00:25.932659

Download and install a reference genome sequence

Hands On: Download and install sacCer3 with Ephemeris
  1. Create a config file for run-data-managers named fetch-sacCer3.yml:

      - id:
          - 'dbkey_source|dbkey': '{{ item.dbkey }}'
          - 'sequence_name': '{{ }}'
          - 'reference_source|reference_source_selector': 'ucsc'
          - 'reference_source|requested_dbkey': '{{ item.dbkey }}'
          - dbkey: sacCer3
            name: 'S. cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)'
          - all_fasta
          - __dbkeys__

    The run-data-managers config file options correspond to the options in the data manager tool XML file. To locate the tool XML file for the Fetch Genomes data manager, you can search for it in the Tool Shed the same way you did when installing it via the UI. You can also open the data manager’s tool form in the UI as if to run it, and, using the drop-down menu at the top right of the tool form, click “See in Tool Shed”. From the Tool Shed, you can click the Development repository link and browse to the tool XML file, data_manager/data_manager_fetch_genome_all_fasta_dbkeys.xml.

    run-data-managers config file component Purpose
    id Data manager full (shed) tool ID, this can be found in shed_data_manager_conf.xml
    params Data manager tool params, these correspond to <param> tags in the tool XML file. Nested paramaters are specified using a pipe character (|).
    param dbkey_source|dbkey Value of <param name="dbkey" ...> in <conditional name="dbkey_source">.
    param sequence_name Value of <param name="sequence_name" ...>.
    param reference_source|reference_source_selector Value of <param name="reference_source_selector" ...> in <conditional name="reference_source">.
    param reference_source|requested_dbkey Value of <param name="requested_dbkey" ...> in <conditional name="reference_source">.
    items A list of variables to template in to params, referenced in param fields with {{ item }}. In the case of genomes, for example, you can run this DM with multiple genomes, or you could give multiple URLs.
    data_table_reload Names of the data tables you want to reload after your DMs are finished running. This can be important for subsequent data managers.
  2. Run the Genome Fetch DM with run-data-managers:

    Code In: Bash
    run-data-managers -g -a <api-key> --config fetch-sacCer3.yml
    Code Out
    Storing log file in: /tmp/ephemeris_f6klyy7v
    Running data managers that populate the following source data tables: ['all_fasta']
    Dispatched job 1. Running DM: "" with parameters: {'dbkey_source|dbkey': 'sacCer3', 'dbkey_source|dbkey_name': 'S. cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)', 'reference_source|reference_source_selector': 'ucsc', 'reference_source|requested_dbkey': 'sacCer3'}
    Job 1 finished with state ok.
    Running data managers that index sequences.
    Finished running data managers. Results:
    Successful jobs: 1
    Skipped jobs: 0
    Failed jobs: 0
  3. In the Galaxy UI, access the Admin menu from the top bar
  4. Click Local Data, which can be found on the left, under Server
  5. Click all_fasta under View Tool Data Table Entries

    You should see that sacCer3 has been added to all_fasta.

    populated all_fasta data table with sacCer3.


    Can we view the changes from the command line? Where are they located?

    The all_fasta.loc file contains the entry for our new genome. If you need a reminder on how to locate it, see the What did this data manager execution change? question in the UI section above.

    Code In: Bash
    cat /srv/galaxy/var/tool-data/
    Code Out: Bash
    #This file lists the locations and dbkeys of all the fasta files
    #under the "genome" directory (a directory that contains a directory
    #for each build). The script will generate the file
    #all_fasta.loc. This file has the format (white space characters are
    #TAB characters):
    #<unique_build_id>  <dbkey>     <display_name>  <file_path>
    #So, all_fasta.loc could look something like this:
    #apiMel3    apiMel3 Honeybee (Apis mellifera): apiMel3      /path/to/genome/apiMel3/apiMel3.fa
    #hg19canon  hg19        Human (Homo sapiens): hg19 Canonical        /path/to/genome/hg19/hg19canon.fa
    #hg19full   hg19        Human (Homo sapiens): hg19 Full         /path/to/genome/hg19/hg19full.fa
    #Your all_fasta.loc file should contain an entry for each individual
    #fasta file. So there will be multiple fasta files for each build,
    #such as with hg19 above.
    sacCer2	sacCer2	S. cerevisiae June 2008 (SGD/sacCer2)	/srv/galaxy/var/tool-data/sacCer2/seq/sacCer2.fa
    sacCer3	sacCer3	S. cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)	/srv/galaxy/var/tool-data/sacCer3/seq/sacCer3.fa
Warning: run-data-managers is not idempotent!

Unlike shed-tools install, the Ephemeris run-data-managers utility is not idempotent. If run a second time on the same set of inputs, you will end up with two entries in your all_fasta data table, with the data from the second run overwriting the data from the first run.

Please see Galaxy issue #15188 for details.

Download and install the BWA data manager

Hands On: Install the BWA/BWA-MEM Data Manager
  1. Install the bwa_mem_index_builder_data_manager data manager tool owned by devteam.

    Code In: Bash
    shed-tools install -g -a <api-key> --name data_manager_bwa_mem_index_builder --owner devteam
    Code Out
    Storing log file in: /tmp/ephemeris_2dyujjvi
    (1/1) Installing repository data_manager_bwa_mem_index_builder from devteam to section "None" at revision 63d5652be07a (TRT: 0:00:00.204350)
    	repository data_manager_bwa_mem_index_builder installed successfully (in 0:00:05.599382) at revision 63d5652be07a
    Installed repositories (1): [('data_manager_bwa_mem_index_builder', '63d5652be07a')]
    Skipped repositories (0): []
    Errored repositories (0): []
    All repositories have been installed.
    Total run time: 0:00:05.804217

Build the BWA index for a reference genome

Hands On: Build the sacCer3 BWA index
  1. Create a config file for run-data-managers named build-sacCer3-bwa.yml:

      - id:
          - 'all_fasta_source': '{{ item.dbkey }}'
          - 'sequence_name': '{{ }}'
          - dbkey: sacCer3
            name: 'S. cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)'
          - bwa_mem_indexes
  2. Run the BWA-MEM index builder DM with run-data-managers:

    Code In: Bash
    run-data-managers -g -a <api-key> --config build-sacCer3-bwa.yml
    Code Out
    Storing log file in: /tmp/ephemeris_esecdef4
    Running data managers that populate the following source data tables: ['all_fasta']
    Running data managers that index sequences.
    Dispatched job 2. Running DM: "" with parameters: {'all_fasta_source': 'sacCer3', 'sequence_name': 'S. cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)'}
    Job 2 finished with state ok.
    Finished running data managers. Results:
    Successful jobs: 1
    Skipped jobs: 0
    Failed jobs: 0
  3. In the Galaxy UI, access the Admin menu from the top bar
  4. Click Local Data, which can be found on the left, under Server
  5. Click bwa_mem_indexes under View Tool Data Table Entries

    You should see that sacCer3 has been added to bwa_mem_indexes. You can also verify this via the command line using the methods you’ve learned above.

Verify that BWA can access the new reference data

Finally, we will verify that the BWA tool can see the new genome indexes.

Hands On: Configure Galaxy
  1. Install the BWA and BWA-MEM tools, if needed. If installing with Ephemeris, the repo name for the --name parameter is bwa and the owner is devteam.

    1. Open Galaxy in your browser and type bwa in the tool search box on the left. If “Map with BWA” is among the search results, you can skip the following steps.
    2. Access the Admin menu from the top bar (you need to be logged-in with an email specified in the admin_users setting)
    3. Click “Install and Uninstall”, which can be found on the left, under “Tool Management”
    4. Enter bwa in the search interface
    5. Click on the first hit, having devteam as owner
    6. Click the “Install” button for the latest revision
    7. Enter “Mapping” as the target section and click “OK”.

  2. Click the “Home” icon at the top to return to the Galaxy analysis interface

  3. Load the Map with BWA ( Galaxy version tool and verify that the sacCer2 genome (if you completed the “Galaxy UI” section) and/or sacCer3 genome (if you completed the “command line with Ephemeris” section) appears in the param-select “Using reference genome” option.

How cool is that? No editing .loc files, no making sure you’ve got TABs instead of spaces. Fully auto!

