- Which coding exon has the highest number of single nucleotide polymorphisms (SNPs) on human chromosome 22?
- Familiarize yourself with the basics of Galaxy
- Learn how to obtain data from external sources
- Learn how to run tools
- Learn how histories work
- Learn how to create a workflow
- Learn how to share your work
Time estimation: 1-1.5h
This practical aims to familiarize you with the Galaxy user interface. It will teach you how to perform basic tasks such as importing data, running tools, working with histories, creating workflows, and sharing your work.
In this tutorial, we will:
- Galaxy management
Suppose you get the following question:
Which coding exon has the highest number of single nucleotide polymorphisms (SNPs) on human chromosome 22?
You are thinking “Wow! This is a simple question… I know where to find the data, at the UCSC Genome Browser, but how do I actually compute this?” There is really a straightforward way of answering this question and it is called Galaxy. So let’s try it…
Browse to your Galaxy instance and log in or register. The Galaxy interface consists of three main parts. The available tools are listed on the left, your analysis history is recorded on the right, and the middle pane will show the home page, a tool form or some dataset content.
Hands-on: Create history
Make sure you start from an empty analysis history.
Creating a new history
- Click the gear icon at the top of the history panel
- Select the option Create New from the menu
Rename your history to be meaningful and easy to find. You can do this by clicking on the title of the history (which by default is Unnamed history) and typing Galaxy 101 as the name. Do not forget to hit the
enterkey on your keyboard to save it.
Upload exon locations
We are now ready to perform our analysis, but first we need to get some data into our history. You can upload files from your computer, but Galaxy can also fetch data directly from external sources. We will now import a list of all the exon locations on chromosome 22 directly from the UCSC table browser.
Hands-on: Data upload from UCSC
In the tool menu, navigate to
Get Data -> UCSC Main - table browser
You will be taken to the UCSC table browser, which looks something like this:
- clade should be set to
- genome should be set to
- assembly should be set to
Dec. 2013 (GRCh38/hg38)
- group should be set to
Genes and Gene Predictions
- track should be set to
- table should be set to
- region should be changed to
- output format should be changed to
BED - browser extensible data
- Send output to should have the option
Click on the get output button and you will see the next screen:
Change Create one BED record per to
Coding Exonsand then click on the Send query to Galaxy button.
After this you will see your first history item in Galaxy’s right pane. It will go through the gray (preparing/queued) and yellow (running) states to become green (success):
When the dataset is green, click on the eye icon to view the contents of the file. It should look something like this:
Each line represents an exon, the first three columns are the genomic location, and the fourth column contains the name of the exon.
Let’s rename our dataset to something more recognizable.
- Click on the pencil icon to edit the dataset attributes.
- In the next screen change the name of the dataset to
- Click the Save button at the bottom of the screen.
Your history should now look something like this:
Upload SNP information
Now we have information about the exon locations, but our question was which exon contains the largest number of SNPs, so let’s get some information about SNP locations from UCSC as well:
Hands-on: SNP information
UCSC Main : Return to the UCSC tool
UCSC Main - table browser
Change the setting in group to
Variationand again region to
The track setting shows the version of the SNP database to get. In this example it is version 150, but you may select the latest one. Your results may vary slightly from the ones in this tutorial when you select a different version, but in general it is a good idea to select the latest version, as this will contain the most up-to-date SNP information.
Click on the get output button to find a menu similar to this:
Make sure that Create one BED record per is set to
Whole Gene(Whole Gene here really means Whole Feature), and click on Send query to Galaxy. You will get your second item in your analysis history.
Now rename your new dataset to
SNPsso we can easily remember what the file contains.
Find exons with the highest number of SNPs
Let’s remind ourselves that our objective is to find which exon contains the most SNPs. Therefore we have to join the file with the exon locations with the file containing the SNP locations (here “join” is just a fancy word for printing the SNPs and exons that overlap side-by-side).
Different Galaxy servers may have tools available under different sections, therefore it is often useful to use the search bar at the top of the tool panel to find your tool.
Hands-on: Finding Exons
Join : Enter the word
joinin the search bar of the tool panel, and select the tool named
Join - the intervals of two datasets side-by-side
Exonsdataset as the first dataset, and the
SNPsdataset as the second dataset, and make sure return is set to
INNER JOINso that only matches are included in the output (i.e. only exons with SNPs in it and only SNPs that fall in exons)
Note: if you scroll down on this page, you will find the help of the tool.
Click the Execute button and view the resulting file (with the eye icon). If everything went okay, you should see a file that looks similar to this:
Remember that variations are possible due to using different versions of UCSC databases, as long as you have similar looking columns you did everything right :)
Let’s take a look at this dataset. The first six columns correspond to the exons, and the last six columns correspond to the SNPs. Column 4 contains the exon IDs, and column 10 contains the SNP IDs. In our screenshot you see that the first lines in the file all have the same exon ID but different SNP IDs, meaning these lines represent different SNPs that all overlap the same exon. Therefore we can find the total number of SNPs in an exon simply by counting the number of lines that have the same exon ID in the fourth column.
For the first 3 exons in your file, what is the number of SNPs that fall into that exon?
Count the number of SNPs per exon
We’ve just seen how to count the number of SNPs in each exon, so let’s do this for all the exons in our file.
Hands-on: Counting SNPs
Group : Open the tool
Group - data by a column and perform aggregate operation on other columns
- Select data: select the output dataset from the
- Group by column:
Column: 4(the column with the exon IDs)
- Insert Operation: click on this button, then set Type to
Countand set On column to
Make sure your screen looks like the image above and click Execute to perform the grouping. Your output dataset will look something like this:
This file contains only two columns. The first contains the exon IDs, and the second the number of times that exon ID appeared in the file - in other words, how many SNPs were present in that exon.
How many exons are there in total in your file?
Hint: Each line now represents a different exon, so you can see the answer to this when you expand the history item, as in the image above.
Sort the exons by SNPs count
Now we have a list of all exons and the number of SNPs they contain, but we would like to know which exons has the highest number of SNPs. We can do this by sorting the file on the second column.
Sort : Navigate to the tool
Sort - data in ascending or descending order
Make sure that the output of the
Grouptool from the previous step is selected as input
Set the on column parameter to
Column: 2, by default it will select a numerical sort in descending order, which is exactly what we want in this case.
Click Execute and examine the output file.
You should now see the same file as we had before, but the exons with the highest number of SNPs are now on top.
Which exon has the highest number of SNPs in your file?
Keep in mind this may depend on your settings when getting the data from UCSC.
Select the top five exons
Let’s say we want a list with just the top-5 exons with highest number of SNPs.
Hands-on: Select first
Select first : Open the tool
Select first - lines from a dataset
Set Select first to
Make sure that the output of the
Sorttool from the previous step is selected as input
Click Execute and examine the output file, this should contain only the first 5 lines of the previous dataset.
Recovering exon info
Congratulations! You have now determined which exons on chromosome 22 have the highest number of SNPs, but what else can we learn about them? One way to learn more about a genetic location is to view it in a genome browser. However, in the process of getting our answer, we have lost information about the location of these exons on the chromosome. But fear not, Galaxy saves all of your data, so we can recover this information quite easily.
Hands-on: Compare two Datasets
Compare two Datasets : Open the tool
Compare two Datasets - to find common or distinct rows
Set the parameters to compare the column 4 of the exon file with column 1 of the top-5 exons file to find matching rows of the first dataset.
Click Execute and examine your output file. It should contain the locations of your top 5 exons:
Displaying data in UCSC genome browser
A good way to learn about these exons is to look at their genomic surrounding. This can be done by using genome browsers. Galaxy can launch a genome browser such as IGV on your local machine, and it can connect to online genome browsers as well. An example of such an online genome browser is the UCSC genome browser.
Hands-on: UCSC genome browser
First, check that the database of your latest history dataset is
hg38. If not, click on the pencil icon and modify the Database/Build: field to
Human Dec. 2013 (GRCh38/hg38) (hg38).
To visualize the data in UCSC genome browser, click on
display at UCSC mainoption visible when you expand the history item.
This will upload the data to UCSC as custom track. To see your data look at the
User Tracknear the top. You can enter the coordinates of one of your exons at the top to jump to that location.
UCSC provides a large number of tracks that can help you get a sense of your genomic area, it contains common SNPs, repeats, genes, and much more (scroll down to find all possible tracks).
In Galaxy your analyses live in histories such as your current one. Histories can be very large, and you can have as many histories as you want. You can control your histories (switching, copying, sharing, creating a fresh history, etc.) in the Options menu on the top of the history pane (gear symbol):
If you create a new history, your current history does not disappear. If you would like to list all of your histories just choose
Saved Histories from the history menu and you will see a list of all your histories in the center pane:
An alternative overview of your histories can be accessed by clicking on the View all histories button at top of your history pane (window icon).
Here you see a more detailed view of each history, and can perform the same operations, such as switching to a different history, deleting a history, purging it (permanently deleting it, this action cannot be reversed), or copying datasets and even entire histories.
You can always return to your analysis view by clicking on Analyze Data in the top menu bar.
Convert your analysis history into a workflow
When you look carefully at your history, you can see that it contains all steps of our analysis, from the beginning to the end. By building this history we have actually built a complete record of our analysis with Galaxy preserving all parameter settings applied at every step. Wouldn’t it be nice to just convert this history into a workflow that we’ll be able to execute again and again?
Galaxy makes this very easy with the
Extract workflow option. This means any time you want to build a workflow, you can just perform it manually once, and then convert it to a workflow, so that next time it will be a lot less work to do the same analysis.
Hands-on: Extract workflow
Clean up your history. If you had any failed jobs (red), please remove those datasets from your history by clicking on the
xbutton. This will make the creation of a workflow easier.
Go to the history Options menu (gear symbol) and select the
The center pane will change as shown below and you will be able to choose which steps to include/exclude and how to name the newly created workflow.
Uncheck any steps that shouldn’t be included in the workflow (if any), and rename the workflow to something descriptive, for example
Find exons with the highest number of SNPs.
Click on the Create Workflow button near the top.
You will get a message that the workflow was created. But where did it go?
Click on Workflow in the top menu of Galaxy. Here you have a list of all your workflows. Your newly created workflow should be listed at the top:
The workflow editor
We can examine the workflow in Galaxy’s workflow editor. Here you can view/change the parameter settings of each step, add and remove tools, and connect an output from one tool to the input of another, all in an easy and graphical manner. You can also use this editor to build workflows from scratch.
Hands-on: Extract workflow
Click on the triangle to the right of your workflow name.
Select Edit to launch the workflow editor. You should see something like this:
When you click on a component, you will get a view of all the parameter settings for that tool on the right-hand side of your screen.
Tip: Hiding intermediate steps
When a workflow is executed, the user is usually primarily interested in the final product and not in all intermediate steps. By default all the outputs of a workflow will be shown, but we can explicitly tell Galaxy which outputs to show and which to hide for a given workflow. This behaviour is controlled by the little asterisk next to every output dataset:
If you click on this asterisk for any of the output datasets, then only files with an asterisk will be shown, and all outputs without an asterisk will be hidden. (Note that clicking all outputs has the same effect as clicking none of the outputs, in both cases all the datasets will be shown.)
Click the asterisk next to
Compare two Datasetstools.
Now, when we run the workflow, we will only see the final two outputs, i.e. the table with the top-5 exons and their SNP counts, and the file with exons ready for viewing in a genome browser. Once you have done this, you will notice that the minimap at the bottom-right corner of your screen will have a colour-coded view of your workflow, with orange boxes representing a tool with an output that will be shown.
If you didn’t specify a name for the input datasets at the beginning, they will be labeled
Input Dataset. In this case you can change the labels now to avoid confusion when using the workflow later on.
In the image above, you see that the top input dataset (with the blue border) connects to the first input of the
Jointool, so this corresponds to the exon data.
Click on the box corresponding to the exon input dataset, and change the Label to
Exonson the right-hand side of your screen.
Repeat this process for the other input dataset. Name it
Features. We used it to calculate highest number of SNPs, but this workflow would also work with other features, so we give it a bit more generic name.
Let’s also rename the outputs. Click on the
Select firsttool and in the menu on the right click on
Configure Output: 'out_file1'and enter a descriptive name for the output dataset in the
Repeat this for the output of the
Compare two Datasetstool.
Save your workflow (important!) by clicking on the gear icon at the top right of the screen, and selecting
Return to the analysis view by clicking on
Analyze Dataat the top menu bar.
We could validate our newly built workflow by running it on the same input datasets than the ones in the
Galaxy 101history used to extract the workflow in order to make sure we do obtain the same results.
Run workflow on different data
Now that we have built our workflow, let’s use it on some different data. For example, let’s find out which exons have the highest number of repeat elements.
Hands-on: Run workflow
Create a new history (gear icon) and give it a name.
We will need the list of exons again. We don’t have to get this from UCSC again, we can just copy it from our previous history. The easiest way to do this is to go to the history overview (window icon at top of history pane). Here you can just drag and drop datasets from one history to another.
- We wanted to know something about the repetitive elements per exon. We get this data from UCSC.
- assembly should be set to
Dec. 2013 (GRCh38/hg38)
- group parameter should be
- position should be
- leave the rest of the settings to the defaults
Click on get output and then Send query to Galaxy on the next screen.
Open the workflow menu (top menu bar). Find the workflow you made in the previous section, and select the option
The center pane will change to allow you to configure and launch the workflow.
Select appropriate datasets for the inputs as shown below, then scroll down and click
Once the workflow has started you will initially be able to see all its steps:
Because most intermediate steps of the workflow were hidden, once it is finished you will only see the final two datasets. If we want to view the intermediate files after all, we can unhide all hidden datasets by selecting
Unhide Hidden Datasetsfrom the history options menu.
Which exon had the highest number of repeats? How many repeats were there?
Share your work
One of the most important features of Galaxy comes at the end of an analysis. When you have published striking findings, it is important that other researchers are able to reproduce your in-silico experiment. Galaxy enables users to easily share their workflows and histories with others.
To share a history, click on the gear symbol in the history pane and select
Share or Publish. On this page you can do 3 things:
- Make History Accessible via Link. This generates a link that you can give out to others. Anybody with this link will be able to view your history.
Make History Accessible and Publish. This will not only create a link, but will also publish your history. This means your history will be listed under
Shared Data → Historiesin the top menu.
- Share with a user. This will share the history only with specific users on the Galaxy instance.
Hands-on: Share history and workflow
- Share one of your histories with your neighbour.
- See if you can do the same with your workflow!
- Find the history and/or workflow shared by your neighbour. Histories shared with specific users can be accessed by those users in their history menu (gear icon) under
Histories shared with me.
Well done! You have just performed your first analysis in Galaxy. You also created a workflow from your analysis so you can easily repeat the exact same analysis on other datasets. Additionally you shared your results and methods with others.
- Galaxy provides an easy-to-use graphical user interface for often complex command-line tools
- Galaxy keeps a full record of your analysis in a history
- Workflows enable you to repeat your analysis on different data
- Galaxy can connect to external sources for data import and visualization purposes
- Galaxy provides ways to share your results and methods with others
Congratulations on successfully completing this tutorial!
FeedbackPlease take a moment and provide your feedback on this tutorial. Your feedback will help guide and improve future revisions to this tutorial. Feedback Form