How to reproduce published Galaxy analyses
Author(s) | Melanie Föll Anne Fouilloux |
Reviewers |
OverviewQuestions:Objectives:
How to reproduce published Galaxy results (workflows and histories)
Learn how to load published data into Galaxy
Learn how to run a published Galaxy workflow
Learn how histories can be inspected and re-used.
Time estimation: 1 hourLevel: Introductory IntroductorySupporting Materials:Published: Aug 25, 2021Last modification: Oct 15, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00192rating Rating: 3.8 (0 recent ratings, 5 all time)version Revision: 12
This training will demonstrate how to reproduce analyses performed in the Galaxy framework. Before we start with the hands-on part, we would like to give you some information about Galaxy.
Galaxy is a scientific workflow, data integration and data analysis and publishing platform. Galaxy is an open-source platform for accessible, reproducible, and transparent computational research. While Galaxy was started to allow non-bioinformaticians to analyze DNA sequencing data, it nowadays enables analysis tasks of many different domains including machine learning, ecology, climate science and omics-type of analyses. Galaxy is easy to use because it is accessible via a web-browser and provides a graphical user interface which enables access to pre-installed tools and large computational resources. In Galaxy, all analyses are stored in so-called histories. The history keeps track of all the tools, tool versions and parameters that were used in the analysis. From such a history, a workflow can be extracted; this workflow can be used to easily repeat the analysis on different data. Both, histories and workflows, can either be shared privately with colleagues or publicly, for example as part of a published manuscript.
For more background information about Galaxy, have a look into the Galaxy publication (Afgan et al. 2018). In depth technical details about technologies that enable reproducible analyses within Galaxy are described in Grüning et al. 2018.
AgendaIn this tutorial, we will cover:
What does Galaxy look like?
Many different Galaxy servers exist. Some are public, some are private, some focus on a specific topic and others like the usegalaxy.* servers cover a broad range of tools. To reproduce published results it is highly recommended to use the same Galaxy server that was used in the original study. In the case that this was a private server that is not accessible to you, you might want to use one of the main Galaxy servers: UseGalaxy.fr, UseGalaxy.eu, UseGalaxy.org.au, UseGalaxy.org. To learn more about the different Galaxy servers visit the slides: options for using Galaxy. The particular Galaxy server that you are using may look slightly different than the one shown in this training. Galaxy instance administrators can choose the exact version of Galaxy they would like to offer and can customize its look to some extent. The basic functionality will be rather similar across instances, so don’t worry! In this training we will use the European Galaxy server on which the original analysis was performed and shared.
Hands-on: Log in or register
- Open your favorite browser (Chrome/Chromium, Safari or Firefox, but not Internet Explorer/Edge!)
- Browse to the Galaxy Europe instance or to a Galaxy instance of your choosing
- Choose Login or Register from the navigation bar at the top of the page
If you have previously registered an account with this particular instance of Galaxy (user accounts are not shared between public servers!), proceed by logging in with your registered public name, or email address, and your password.
If you need to create a new account, click on Register here instead.
The Galaxy interface consists of three main parts:
- The available tools are listed on the left
- Your analysis history is recorded on the right
- The central panel will let you run analyses and view outputs
Create a history and load data into it
Each analysis in Galaxy starts by creating a new analysis history and loading data into it. Galaxy supports a huge variety of data types and data sources. Different ways of bringing data into Galaxy are explained in the interface slides. To reproduce published results, the data needs to be loaded from the public repository where the authors have deposited the data. This is most often done by importing data via a web link.
Hands-on: Create history
Make sure you start from an empty analysis history.
To create a new history simply click the new-history icon at the top of the history panel:
Rename your history to be meaningful and easy to find. For instance, you can choose Reproduction of published Galaxy results as the name of your new history.
- Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
- Type the new name:
Reproduction of published Galaxy results
- Click on Save
- To cancel renaming, click the galaxy-undo “Cancel” button
If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:
- Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
- Type the new name:
Reproduction of published Galaxy results
- Press Enter
Comment: Background about the datasetThe Iris flower data set, also known as Fisher’s or Anderson’s Iris data set, is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in his 1936 paper (Fisher 1936). Each row of the table represents an iris flower sample, describing its species and the dimensions in centimeters of its botanical parts, the sepals and petals. You can find more detailed information about this dataset on its dedicated Wikipedia page.
Hands-on: Data upload
Import the file
iris.csv
from Zenodo or from the data library (ask your instructor)https://zenodo.org/record/1319069/files/iris.csv
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename galaxy-pencil the dataset to
iris
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
- Check the datatype
- Click on the history item to expand it to get more information.
- The datatype of the iris dataset should be
csv
.- Change galaxy-pencil the datatype if it is different than
csv
.
- Option 1: Datatypes can be autodetected
- Option 2: Datatypes can be manually set
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
- Click the Auto-detect button to have Galaxy try to autodetect it.
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
csv
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Add an
#iris
tag galaxy-tags to the datasetDatasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Make sure the tag starts with a hash symbol (
#
), which will make the tag stick not only to this dataset, but also to any results derived from it. This will help you make sense of your history.
Some input datasets might need more specialized treatment than explained here. A few data types contain more than one subfile. These are uploaded via the composite data function, which is a new tab on the right of regular upload. Then at the bottom set “composite type” to your file format. For each subfile a select box will appear with a description next to it, about which subfile has to be selected where. Some workflows require input files as dataset collections, in such cases “Input dataset collection” are shown as input when editing or viewing the workflow in the workflow menu. Collections contain several single dataset of the same type tied together. In case a workflow input requires a collection, you’ll need to build a collection out of your files after uploading them. A specialized training explains how to use collections.
- Click on galaxy-selector Select Items at the top of the history panel
- Check all the datasets in your history you would like to include
Click n of N selected and choose Build Dataset List
- Enter a name for your collection
- Click Create collection to build your collection
- Click on the checkmark icon at the top of your history again
In case you want to run a published Galaxy workflow on your own data, you can find explanations about the options to upload your own data in the interface slides.
Import and run a Galaxy workflow
Galaxy workflows may be published either directly via the Galaxy server or on public workflow repositories such as WorkflowHub. Thus the workflow may be present in one of the three ways:
- As a .ga file or url link, which needs to be imported into Galaxy
- As a link from a personal Galaxy server account that needs to be added to the own Galaxy account
- as a link that directly starts running the workflow in a specific Galaxy server, which is possible via the WorkflowHub website.
This tutorial follows option 1, but options 2 and 3 are no more difficult
This is not part of the training, but information in case you received a workflow of interest via way 2) or 3).
Link from a personal Galaxy: In case you received a link from a personal Galaxy user account
- you need to log into exactly the same Galaxy server from where the workflow link is shared, which should be clear from the start of the link, e.g. “https://usegalaxy.eu/…”
- Click on the link and on the upper right on to the plus symbol (import workflow)
- Continue with Step 2 of the following hands-on box
Link from WorkflowHub: In WorkflowHub there is an option to directly run a workflow.
- Make sure that you prepared your input data on the same server as specified in the run button.
- Click on “Run on usegalaxy.eu”
- select inputs
- “Run workflow”
Hands-on: Import and run workflow available as .ga file or link
Import the workflow either via url directly from Zenodo or by uploading the .ga file
https://zenodo.org/record/5090049/files/main_workflow.ga
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on galaxy-upload Import at the top-right of the screen
- Provide your workflow
- Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
- Option 2: Upload the workflow file in the box labelled “Archived Workflow File”
- Click the Import workflow button
Below is a short video demonstrating how to import a workflow from GitHub using this procedure:
Start the workflow by clicking on the run symbol on the last column in the workflow overview list
Select the
iris
dataset as the input dataset.Run the workflow by clicking on
run workflow
QuestionHow many history items do you have after running the workflow?
12, out of which 4 are shown and 8 hidden (at top of history right ander the history name)
Some workflow outputs might be considered as not very important intermediate results and are thus marked as getting hidden in the analysis history after they turned green. This makes the history easier to navigate through the main results which are visible in the history. Hidden datasets can be made visible individually by clicking on the eye with a slash on top of the history and then clicking “Unhide” for the individual datasets. To unhide many dataset at once, click galaxy-selector “Select Items” at the top left of the history; then select all hidden datasets that you would like to unhide, then click “For n of N selected” and then “Unhide”.
By starting the workflow all jobs are sent to the Galaxy cluster for analysis. Sometimes it can take a bit until the datasets show up in your history. The jobs are processed one after the other or in parallel if the same input is used for several steps. Grey means waiting to run, yellow means running and green means finished. Red means there was an error.
When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.
- Expand the red history dataset by clicking on it.
- Sometimes you can already see an error message here
View the error message by clicking on the bug icon galaxy-bug
- Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
- Expand the history item
- Click on the details icon
- Scroll down to the Job Information section to view the 2 logs:
- Tool Standard Output
- Tool Standard Error
- For more information about specific tool errors, please see the Troubleshooting section
- Submit a bug report! If you are still unsure what the problem is.
- Click on the bug icon galaxy-bug
- Write down any information you think might help solve the problem
- See this FAQ on how to write good bug reports
- Click galaxy-bug Report button
- Ask for help!
- Where?
- In the GTN Matrix Channel
- In the Galaxy Matrix Channel
- Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).
- When asking for help, it is useful to share a link to your history
Comment: The tool stays greyThis scenario will likely not happen with this training analysis, but might happen with a real workflow. The tool runs are scheduled on the computing infrastructures according to their consumption of cores and memory. Thus, tools that are assigned lots of cores and/or memory need to wait until an appropriate computing spot is available. Several Galaxy server display the current computational load which gives you an idea how busy it currently is.
Each dataset that turns green can already be inspected even though later datasets are still running. The second part of the training will focus on how to inspect datasets in a history.
Inspecting the analysis history
Each history item represents one file or dataset, except when collections are used. History items are numbered, duplicates are not possible because any type of dataset manipulation will automatically generate a separate dataset in the history. Some tools produce several output files and therefore the number of history items can be larger than the number of steps in a workflow. Dataset names in the analysis history are not relevant for the tool run, therefore they can be adjusted in order to make the history easier to understand. The default name of a history item is composed of the tool name that was run and the history item number(s) of the input file(s), e.g. ‘Unique on data 5’
Hands-on: Inspect history
Vizualize the two scatter plots by clicking on their eye icons (view data)
If you would like to view two or more datasets at once, you can use the Window Manager feature in Galaxy:
- Click on the Window Manager icon galaxy-scratchbook on the top menu bar.
- You should see a little checkmark on the icon now
- View galaxy-eye a dataset by clicking on the eye icon galaxy-eye to view the output
- You should see the output in a window overlayed over Galaxy
- You can resize this window by dragging the bottom-right corner
- Click outside the file to exit the Window Manager
- View galaxy-eye a second dataset from your history
- You should now see a second window with the new dataset
- This makes it easier to compare the two outputs
- Repeat this for as many files as you would like to compare
- You can turn off the Window Manager galaxy-scratchbook by clicking on the icon again
Show all datasets by clicking on
hidden
on top of your history, right below the history nameCompare the
Convert csv to tabular
file with theDatamash
file side by side using the scratchbookTrack how the
Datamash
results where obtained by clicking on theDatamash
item in the history and then on itsi
icon (view details). The performed operations can be found in the sectionTool parameters
Question
- What are the different Iris species?
- How many lines has the
Convert csv to tabular
file?- Which column was grouped in during the Datamash operation?
- Which column of the
Remove beginning
file contains sepal length and which petal length?
- The 3 different Iris species are:
- setosa
- versicolor
- virginica
151 lines (by clicking on the file one can see the line count under its name)
Column 5 (details of Datamash tool: Group by fields - 5)
- Column 1 and 3 (the dataset was generated by removing the header line from data 2, thus the content of the columns is the same as in data file 2)
If you got the same answers as written in the above solution box, then congratulations! You have imported and fully reproduced the results of a previous analysis.
However, sometimes you may wish to do more, like…
Manipulating the analysis
Maybe you are interested in changing some of the original tool parameters and see how this changes the result. The parameter changes can be either done inside the workflow editor, which makes it easy to change many parameters at once (training on Using Workflow Parameters) or directly in the history. To repeat a single analysis step with new parameters this can be done directly in the Galaxy history with the re-run
option.
Hands-on: Manipulate the analysis steps
Rerun the Scatterplot to plot Sepal length vs. Petal length
- Expand one of the output datasets of the tool (by clicking on it)
- Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
- “Column to plot on x-axis”:
1
- “Column to plot on y-axis”:
3
Additional Exercise: Import a published analysis history and explore it
Often not only workflows and raw data are published, but also the full Galaxy histories. These histories can either be inspected via their provided link, or imported in order to enable manipulating them in your own Galaxy account.
Hands-on: Importing a history
Import a published history shared via an EU Server account
https://usegalaxy.eu/u/annefou/h/galaxy101-for-everyone-diamond-dataset
- Open the link to the shared history
- Click on the Import this history button on the top left
- Enter a title for the new history
- Click on Copy History
The diamonds
dataset comes from the well-known ggplot2 package developed by Hadley Wickham and was initially collected from the Diamond Search Engine in 2008.
The original dataset consists of 53940 specimen of diamonds, for which it lists the prices and various properties.
For this training, we have created a simpler dataset from the original, in which only the five columns relating to the price and the so-called 4 Cs (carat, cut, color and clarity) of diamond characteristics have been retained. The same workflow as before was used to re-analysis this second dataset and create the analysis history, which highlights the re-usability of workflows across inputs.
Comment: The 4 Cs of diamond gradingAccording to the GIA’s (Gemological Institute of America) diamond grading system
- Carat refers to the weight of the diamond when measured on a scale
- Cut refers to the quality of the cut and can take the grades Fair, Good, Very Good, Premium and Ideal
- Color describes the overall tint, or lack thereof, of the diamond from colorless/white to yellow and is given on a letter scale ranging from D to Z (D being the best, known as colorless).
- Clarity describes the amount and location of naturally occuring “inclusions” found in nearly all diamonds on a scale of eleven grades ranging from Flawless (the ideal situation) to I3 (Included level 3, the worst quality).
Question
- What are the different Cut categories?
- How many lines has the
diamonds.csv
file?- Is the color an important factor for the Diamond price?
- The 5 different Cut categories are:
- Fair
- Good
- Ideal
- Premium
- Very Good
53940 lines (by clicking on the file one can see the line count under its name)
- We can create a new scatter plot and use color as a factor (Advanced, column differentiating the different groups: 3). Then, holding carat weight constant, we see on the scatter plot that color is linked to the price of the diamond. So color also explains a lot of the variance found in price!
Conclusion
trophy Well done! You have just reproduced your first analysis in Galaxy.