RNA Seq Counts to Viz in R

Author(s)	Bérénice Batut Fotis E. Psomopoulos Toby Hodges
Reviewers

Overview
Questions:

How can I create neat visualizations of the data?

Objectives:

Describe the role of data, aesthetics, geoms, and layers in ggplot functions.

Customize plot scales, titles, subtitles, themes, fonts, layout, and orientation.

Requirements:

Introduction to Galaxy Analyses

slides Slides: Quality Control

tutorial Hands-on: Quality Control

slides Slides: Mapping

tutorial Hands-on: Mapping

tutorial Hands-on: RStudio in Galaxy

tutorial Hands-on: R basics in Galaxy

tutorial Hands-on: Advanced R in Galaxy

tutorial Hands-on: Reference-based RNA-Seq data analysis

Time estimation: 1 hour

Supporting Materials:

Datasets

FAQs

video Recordings

video Tutorial (February 2021) - 30m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

Published: Oct 8, 2019

Last modification: Oct 15, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00299

rating Rating: 4.6 (0 recent ratings, 12 all time)

version Revision: 21

This tutorial will show you how to visualise RNA Sequencing Counts with R

Comment

This tutorial is significantly based on the Carpentries “Intro to R and RStudio for Genomics” lesson

With RNA-Seq data analysis, we generated tables containing list of DE genes, their expression, some statistics, etc. We can manipulate these tables using Galaxy, as we saw in some tutorials, e.g. “Reference-based RNA-Seq data analysis”, and create some visualisations.

Sometimes we want to have some customizations on visualization, some complex table manipulations or some statistical analysis. If we can not find a Galaxy tools for that or the right parameters, we may need to use programming languages as R or Python.

It is expected that you are already somewhat familiar with the R basics (how to create variables, create and access lists, etc.) If you are not yet comfortable with those topics, we recommend that you complete the requirements listed at the start of this tutorial first, before proceeding.

If you are starting with this tutorial, you will need to import a dataset:
Hands On: Using datasets from Galaxy
Upload the following URL to Galaxy:
https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular
Note the history ID of the dataset in your Galaxy history
RStudio in Galaxy provides some special functions to import and export from your history:
annotatedDEgenes <- read.csv(gx_get(2), sep="\t") # will import dataset number 2 from your history, use the correct ID for your dataset.
Hands On: Using datasets without Galaxy
Read the tabular file into an object called annotatedDEgenes:
## read in a CSV file and save it as 'annotatedDEgenes'
annotatedDEgenes <- read.csv("https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular", sep="\t")

In this tutorial, we will take the list of DE genes extracted from DESEq2’s output that we generated in the “Reference-based RNA-Seq data analysis” tutorial, manipulate it and create some visualizations.

Agenda

In this tutorial, we will cover:

Visualization

Volcano plot

Volcano plot by chromosomes

Barplot of the number of DE genes

Conclusion

Hands On: Launch RStudio

Depending on which server you are using, you may be able to run RStudio directly in Galaxy. If that is not available, RStudio Cloud can be an alternative.

Currently RStudio in Galaxy is only available on UseGalaxy.eu and UseGalaxy.org

Open the Rstudio tool tool by clicking here to launch RStudio

Click Run Tool

The tool will start running and will stay running permanently

Click on the “User” menu at the top and go to “Active InteractiveTools” and locate the RStudio instance you started.

If RStudio is not available on the Galaxy instance:

Register for RStudio Cloud, or login if you already have an account

Create a new project

Visualization

In RNA-seq data analysis and other omics experiments, visualization are an important step to check the data, their quality and represent the results found for publication.

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 graphics are built step by step by adding new elements, i.e. layers:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
  <GEOM_FUNCTION>() +
  <GEOM_FUNCTION>() +
  <GEOM_FUNCTION>()

Adding layers in this fashion allows for extensive flexibility and customization of plots.

Volcano plot

Hands On: First plot, step by step
Load ggplot2
 library(ggplot2)
You might need to install ggplot2 first
> install.packages("ggplot2")
Bind a plot to a specific data frame
ggplot(data = annotatedDEgenes)
A new window will open in the bottom-right panel
Define a mapping (using the aes aesthetic function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
 ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = P.value))
X-axis is now the log2 FC and and Y-axis the p-value

Comment: Data format

Some ggplot2 functions need data in the ‘long’ format:

a column for every dimension

a row for every observation

Well-structured data will save you lots of time when making figures with ggplot2.
Add information about the graphical representations of the data in the plot (points, lines, bars) using geoms function added via the + operator
ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = P.value)) +
      geom_point()
Comment: Graphical representations of the data in the plot

ggplot2 offers many different geoms; we will use some common ones today, including:

geom_point() for scatter plots, dot plots, etc.

geom_boxplot() for, well, boxplots!

geom_line() for trend lines, time series, etc.

We should obtain our first ggplot2 plot:

This plot is called a volcano plot, a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). The most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top. With this plot, we can then quickly identify genes with large fold changes that are also statistically significant, i.e. probably the most biologically significant genes.

The current version of the plot is not really informative, mostly due to the high number of p-values close to zero. In order to resolve this, we will apply -log10() to the y-axis values.

Hands On: Volcano plot with log values on y-axis
Create volcano plot with log values on the y-axis
ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value))) +
     geom_point()

Question: Categorising expression levels

Why are there no points with log2 FC between -1 and 1?

The input file we used are the 130 genes with a significant adjusted p-values (below < 0.05) and an absolute fold change higher than 2 (log2 FC < -1 or log 2 FC > 1).

Comment: The `+` sign

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot templates and conveniently explore different types of plots, so the above plot can also be generated with code like this:
# Assign plot to a variable
de_genes_plot <- ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value)))

# Draw the plot
de_genes_plot +
    geom_point()
The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.
# This is the correct syntax for adding layers
> de_genes_plot +
      geom_point()

# This will not add the new layer and will return an error message
> de_genes_plot
    + geom_point()

Comment: Some extra comments

Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().

You can also specify mappings for a given geom independently of the mappings defined globally in the ggplot() function.

Building plots with ggplot2 is typically an iterative process. We start by defining the dataset we’ll use, lay out the axes, and choose a geom. We can now modify this plot to extract more information from it.

Hands On: Format plot

Add transparency (alpha) to avoid overplotting

ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value))) +
     geom_point(alpha = 0.5)

Add colors for all the points

ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value))) +
    geom_point(alpha = 0.5, color = "blue")

Color point based on their strand
```
 ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value), color = Strand)) +
     geom_point(alpha = 0.5)
```
We use a vector as an input to the argument color and ggplot2 provides a different color corresponding to different values in the vector.

Add axis labels

ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value), color = Strand)) +
    geom_point(alpha = 0.5) +
      labs(x = "log2(Fold Change)",
           y = "-log10(P-Value)")

We have now a nice Volcano plot:

Question: Standard error over the `-log10` of the adjusted `P-value`

Create a scatter plot of the standard error over the -log10 of the adjusted P-value with the chromosomes showing in different colors. Make sure to give your plot relevant axis labels.
ggplot(data = annotatedDEgenes, aes(x = -log10(P.adj), y = StdErr, color = Chromosome)) +
  geom_point() +
     labs(x = "-log10(P.adj)",
          y = "StdErr")

Volcano plot by chromosomes

ggplot2 has a special technique called faceting that allows us to split one plot into multiple plots based on a factor included in the dataset. We will use it to split our volcano plot into five panels, one for each chromosome.

Hands On: Volcano plot by chromosomes
Split volcano plot by chromosome using facet_grid
ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value), color = Strand)) +
  geom_point() +
  labs(x = "log2(Fold Change)",
       y = "-log10(P-Value)") +
  facet_grid(. ~ Chromosome)
Split volcano plot by chromosome using facet_grid with plots stacked vertically
ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value), color = Strand)) +
  geom_point() +
  labs(x = "log2(Fold Change)",
       y = "-log10(P-Value)") +
  facet_grid(Chromosome ~ .)
The facet_grid geometry allows to explicitly specify how we want your plots to be arranged via formula notation (rows ~ columns). The . can be used as a placeholder that indicates only one row or column).
Add white background using theme_bw() to make the plot more readable when printed
ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value), color = Strand)) +
  geom_point() +
  labs(x = "log2(Fold Change)",
       y = "-log10(P-Value)") +
  facet_grid(Chromosome ~ .) +
  theme_bw()
In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization.

theme_minimal() and theme_light() are popular

theme_void() can be useful as a starting point to create a new hand-crafted theme

The complete list of themes is available on the ggplot2 website. The ggthemes package provides a wide variety of options (including an Excel 2003 theme). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.
Remove the grid
ggplot(data = annotatedDEgenes, aes(x = log2.FC., y = -log10(P.value), color = Strand)) +
  geom_point() +
  labs(x = "log2(Fold Change)",
       y = "-log10(P-Value)") +
  facet_grid(Chromosome ~ .) +
  theme_bw() +
  theme(panel.grid = element_blank())

Question: Standard error over the `-log10` of the adjusted `P-value`

Create a scatter plot of the standard error over the -log10 of the adjusted P-value with the chromosomes showing in different colors and one facet per strand. Make sure to give your plot relevant axis labels.
ggplot(data = annotatedDEgenes, aes(x = -log10(P.adj), y = StdErr, color = Chromosome)) +
 geom_point() +
 labs(x = "-log10(P.adj)",
      y = "StdErr") +
 facet_grid(Strand ~ .)

Barplot of the number of DE genes

We would like now to make a barplot showing the number of differentially expressed genes.

Hands On: Barplot of the number of DE genes
Create a barplot with geom_bar function of the number of DE genes for each feature with one plot per chromosome
ggplot(data = annotatedDEgenes, aes(x = Feature, fill = Chromosome)) +
geom_bar() +
facet_grid(Chromosome ~ .)

Question: Remove the legend

The chromosome is labeled on the individual plot facets so we don’t need the legend. Use the help file for geom_bar and any other online resources to remove the legend from the plot.
ggplot(data = annotatedDEgenes, aes(x = Feature, fill = Chromosome)) +
 geom_bar(show.legend = F) +
 facet_grid(Chromosome ~ .)

Question: Create a R script with plots

Take another few minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own (using the RStudio ggplot2 cheat sheet for inspiration).

Here are some ideas:

Change the size or shape of the plotting symbol

Change the name of the legend

Change the label of the legend

Use a different color palette (see the cookbook here).

Conclusion

Data manipulation and visualization are important parts of any RNA-Seq analysis. Galaxy provides several tools for that as explained in several tutorials:

But sometimes we need more customization and then need to use programming languages as R or Python.

Working with a programming language (especially if it’s your first time) often feels intimidating, but the rewards outweigh any frustrations. An important secret of coding is that even experienced programmers find it difficult and frustrating at times – so if even the best feel that way, why let intimidation stop you? Given time and practice you will soon find it easier and easier to accomplish what you want.

Finally, we won’t lie; R is not the easiest-to-learn programming language ever created. So, don’t get discouraged! The truth is that even with the modest amount of R covered today, you can start using some sophisticated R software packages, and have a general sense of how to interpret an R script. Get through these lessons, and you are on your way to being an accomplished R user!

You've Finished the Tutorial

Key points

When creating plots with ggplot2, think about the graphics in layers (aesthetics, geometry, statistics, scale transformation, and grouping).

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Bérénice Batut, Fotis E. Psomopoulos, Toby Hodges, RNA Seq Counts to Viz in R (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-counts-to-viz-in-r/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{transcriptomics-rna-seq-counts-to-viz-in-r,
author = "Bérénice Batut and Fotis E. Psomopoulos and Toby Hodges",
	title = "RNA Seq Counts to Viz in R (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-counts-to-viz-in-r/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

Gallantries

This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics.

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/transcriptomics/tutorials/rna-seq-counts-to-viz-in-r/tutorial.json | jq .admin_install_yaml -r)
Alternatively you can copy and paste the following YAML
---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools: []

t{ hist[0] | to_stars }} 7

t{ hist[0] | to_stars }} 5

May 2023

5 stars: Liked: The explanation and demonstration were very clear. Disliked: Overall, the video was very good to motivate more practice!

February 2021

5 stars: Liked: I like that it has a clear relation to the NGS topic, this helped to keep me motivated Disliked: Actually it was very well developed, maybe one for heatmaps would be a great addition

5 stars: Liked: well explained Disliked: I would include some links with more options such as color palettes... shapes...

5 stars: Liked: very concise and helpful !

5 stars: Liked: very interesting library to visualize data

October 2019

4 stars: Disliked: Would like to have more emphasis on actual data analysis and visualization. Too much time was spent on data frame functions. Morning session already provided good introduction to data handling basics.

4 stars: Liked: Lot of different ways to manipulate data Disliked: could be more taken into apart, several focuses, more focus to the vizualisation