Visualization of RNA-Seq results with CummeRbund

Overview

question Questions
  • How are RNA-Seq results stored?
  • Why are visualization techniques needed?
  • How to select our desired subjects for differential gene expression analysis?
objectives Objectives
  • Manage RNA-Seq results
  • Extract the desired subject for differential gene expression analysis
  • Visualize information
requirements Requirements

time Time estimation: 1h

Introduction

RNA-Seq analysis helps researchers annotate new genes and splice variants, and provides cell- and context-specific quantification of gene expression. RNA-Seq data, however, are complex and require both computer science and mathematical knowledge to be managed and interpreted.

Visualization techniques are key to overcome the complexity of RNA-Seq data, and represent valuable tools to gather information and insights.

Agenda

In this tutorial, we will deal with:

  1. Reasons for visualizing RNA-Seq results
  2. Importing RNA-Seq result data
  3. Filtering and sorting
  4. CummeRbund

Reasons for visualizing RNA-Seq results

To make sense of the available RNA-Seq data, and overview the condition-specific gene expression levels of the provided samples, we need to visualize our results. Here we will use CummeRbund.

CummeRbund is an open-source tool that simplifies the analysis of a CuffDiff RNA-Seq output. In particular, it helps researchers:

A typical workflow for the visualization of RNA-Seq data involving CummeRbund:

workflow

CummeRbund reads your RNA-Seq results from a SQLite database. This database has to be created using CuffDiff’s SQLite output option.

tip Tip: SQLite output with CuffDiff

Instruct CuffDiff to organize its output in a SQLite database to be read CummeRbund.

SQLite output

Importing RNA-Seq result data

hands_on Hands-on: Data upload

  1. Create a new history
  2. Import the CuffDiff SQLite dataset

    • Copy the link location
    • Open the Galaxy Upload Manager
    • Select Paste/Fetch Data
    • Paste the link into the text field
    • Press Start

    comment Comments

    Rename the dataset to “RNA-Seq SQLite result data”

    By default, when data is imported via its link, Galaxy names it with its URL.

CuffDiff’s output data is organized in a SQLite database, so we need to extract it to be able to see what it looks like.

For this tutorial, we are interested in CuffDiff’s tested transcripts for differential expression.

hands_on Hands-on: Extract CuffDiff results

  1. Extract CuffDiff tool with the following parameters
    • “Select tables to output” to Transcript differential expression testing
  2. Inspect the table

    tip Tip: Inspecting the content of a file in Galaxy

    • Click on the eye (“View data”) on the right of the file name in the history
    • Inspect the content of the file on the middle

Each entry represents a differentially expressed gene, but not all are significant. We want to keep only those that are reported as significant differentially expressed.

question Questions

  1. How to retain only the significant differentially expressed genes?
  2. Which column stores this information?

solution Solution

  1. We need to filter on the column storing the record’s significance
  2. Column 14

Filtering and sorting

We now want to first highlight the most significant differentially expressed genes in our analysis, and then obtain informative visualizations.

hands_on Hands-on: Extract CuffDiff’s most significant differentially expressed genes

  1. Filter tool with the following parameters
    • “Filter” to the extracted table from the previous step
    • “With following condition” to an appropriate filter over the target column (see questions below when in doubt)

    question Questions

    1. What column stores the information of significance for each record?
    2. Which conditional expression has to be set to filter all records on the selected column?
    3. What happened to the records in the original table?

    solution Solution

    1. column 14
    2. c14==’yes’
    3. All records whose “significant” field was set to “yes” have been retained, while the others filtered out

Look at your data. The differential expression values are stored on column 10, we will sort (descending) all records on the basis of their value at the 10th column

  1. Sort tool: with the following parameters
    • “Sort Dataset” to the filtered table
    • “on column”, “with flavor” and “everything in” to the appropriate values (see above)

    question Questions

    1. Since the start of our filtering process, how many records now represent the significant subset for extracting informations?
    2. What does this shrinking of the number of lines represent?

    solution Solution

    1. Click on the boxes in your history, their small preview higlights the number of lines: from ~140,000 to 219
    2. This process represents a necessary step to gather insights on the biological meaning of our samples in our analyses: putting the original raw RNA-Seq result data into context, cutting down the less-meaningful records to focus on what is needed to go from data to information

CummeRbund

With CummeRbund we can visualize our RNA-Seq results of interest.

CummeRbund generates always two outputs:

We are interested in visualizing all expression values of all transcripts relative to the most significant differentially expressed gene we found in the previous section.

hands_on Hands-on: Visualization

  1. CummeRbund tool with the following parameters
    • Click on “Insert plot”
    • “Width” and “Height” to 800x600
    • “Plot type” to Expression Plot
    • “Expression levels to plot” to Isoforms
    • “Gene ID” to NDUFV1
    • Your input form parameters should look like the following. If so, click on “Execute”

Expression plot_form

Our first CummeRbund plot is the “Expression Plot”:

Expression plot

The Expression Plot represents the expression of all isoforms of a single gene (NDUFV1) with replicate FPKMs exposed.

Our plot has a modest number of isoforms, and is therefore already readable. However, in case of 5 or 6 isoforms, the plot can look very busy. We can therefore change the visualization type by selecting another type of plot.

hands_on Hands-on: Visualization

  1. CummeRbund tool with the following parameters
    • Click on “Insert plot”
    • “Width” and “Height” to 800x600
    • “Plot type” to Expression Bar Plot
    • “Expression levels to plot” to Isoforms
    • “Gene ID” to NDUFV1

Expression bar plot

Expression Bar Plot of a single gene (NDUFV1) with replicate FPKMs exposed.

comment Comment

These plots are shown also in this Galaxy video tutorial.

Would you like to obtain more sophisticated visualization of your RNA-Seq analysis results? Select different CummeRbund plot options, and look at their parametrizations according to the filtering and sorting operations we performed

Conclusion

Visualization tools help researchers making sense of data, providing a bird’s-eye view of the underlying analysis results. In this tutorial we overviewed the advantages of visualizing RNA-Seq results with CummeRbund, and gained insights on CuffDiff’s big-data output by plotting informations relative to the most significant differentially expressed genes in our RNA-Seq analysis.

keypoints Key points

  • Extract informations from a SQLite CuffDiff database
  • Filter and sort results to highlight differential expressed genes of interest
  • Generate publication-ready visualizations for RNA-Seq analysis results

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

congratulations Congratulations on successfully completing this tutorial!


feedback Help us improve this content!

Please take a moment to fill in the Galaxy Training Network Feedback Form. Your feedback helps us improve this tutorial and will be considered in future revisions.