Comparing inferred cell compositions using MuSiC deconvolution

Overview
Creative Commons License: CC-BY Questions:
  • How do the cell type distributions vary in bulk RNA samples across my variable of interest?

  • For example, are beta cell proportions different in the pancreas data from diabetes and healthy patients?

Objectives:
  • Apply the MuSiC deconvolution to samples and compare the cell type distributions

  • Compare the results from analysing different types of input, for example, whether combining disease and healthy references or not yields better results

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Jan 20, 2023
Last modification: Nov 9, 2023
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00243
version Revision: 8

The goal of this tutorial is to apply bulk RNA deconvolution techniques to a problem with multiple variables - in this case, a model of diabetes is compared with its healthy counterparts. All you need to compare inferred cell compositions are well-annotated, high quality reference scRNA-seq datasets, transformed into MuSiC-friendly Expression Set objects, and your bulk RNA-samples of choice (also transformed into MuSiC-friendly Expression Set objects). For more information on how MuSiC works, you can check out their github site MuSiC or published article (Wang et al. 2019).

Comment: Research question
  • How does variable X impact the cell distributions in my samples?
  • Needs: scRNA-seq reference dataset; bulk RNA-seq samples of interest to compare
Agenda

In this tutorial, we will cover:

  1. Data
    1. Get data
  2. Infer cellular composition & compare
    1. Altogether: Deconvolution with a combined sc reference
    2. Like4like: Deconvolution of healthy samples with a healthy reference and diseased samples with a diseased reference
    3. healthyscref: Deconvolution using only healthy cells as a reference
  3. Conclusion

Data

In the standard MuSiC tutorial, we used human pancreas data. We will now use the same single cell reference dataset Segerstolpe et al. 2016 with its 10 samples of 6 healthy subjects and 4 with Type-II diabetes (T2D), as well as the bulk RNA-samples from the same lab (3 healthy, 4 diseased). Both of these datasets were accessed from the public EMBL-EBI repositories and transformed into Expression Set objects in the previous two tutorials. For both the single cell reference and the bulk samples of interest, you have generated Expression Set objects with only T2D samples, only healthy samples, and a final everything-combined sample for the scRNA reference. We won’t need the combined bulk RNA dataset. The plan is to analyse this data in three ways: using a combined reference (altogether); using only the healthy single cell reference (healthyscref); or using a healthy and combined reference separately (like4like), all to identify differences in cellular composition.

Three colours of arrows connect bulk healthy & diseased data sets to a combined single cell (altogether); bulk healthy and single cell healthy & bulk diseased with single cell diseased (like4like); and bulk diseased and healthy with the single cell healthy reference (healthyscref).Open image in new tab

Figure 1: Plan of analysis

If you have followed the previous tutorials, you will have built your single cell ESet object and your bulk ESet object, then you can copy these into a new history now. Otherwise, follow the steps below to import the datasets you’ll need.

There 3 ways to copy datasets between histories

  1. From the original history

    1. Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel
    2. Click on Copy Datasets
    3. Select the desired files

    4. Give a relevant name to the “New history”

    5. Validate by ‘Copy History Items’
    6. Click on the new history name in the green box that have just appear to switch to this history
  2. Using the galaxy-columns Show Histories Side-by-Side

    1. Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)
    2. Click on galaxy-columns Show Histories Side-by-Side
    3. If your target history is not present
      1. Click on ‘Select histories’
      2. Click on your target history
      3. Validate by ‘Change Selected’
    4. Drag the dataset to copy from its original history
    5. Drop it in the target history
  3. From the target history

    1. Click on User in the top bar
    2. Click on Datasets
    3. Search for the dataset to copy
    4. Click on its name
    5. Click on Copy to current History

Get data

Hands-on: Data upload
  1. Create a new history for this tutorial “Deconvolution: Compare”
  2. Import the files from Zenodo

    • Human single cell RNA ESet objects (tag: #singlecell)

      https://zenodo.org/record/7319925/files/ESet_object_sc_combined.rdata
      https://zenodo.org/record/7319925/files/ESet_object_sc_T2D.rdata
      https://zenodo.org/record/7319925/files/ESet_object_sc_healthy.rdata
      
    • Human bulk RNA ESet objects (tag: #bulk)

      https://zenodo.org/record/7319925/files/ESet_object_bulk_healthy.rdata
      https://zenodo.org/record/7319925/files/ESet_object_bulk_T2D.rdata
      
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets as needed

  4. Add to each file a tag corresponding to #bulk and #scrna

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Infer cellular composition & compare

It’s finally time!

Altogether: Deconvolution with a combined sc reference

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Hands-on: Comparing: altogether
  1. MuSiC Compare ( Galaxy version 0.1.1+galaxy4) with the following parameters:
    • In “New scRNA Group”:
      • param-repeat “Insert New scRNA Group”
        • “Name of scRNA Dataset”: scRNA_set
        • param-file “scRNA Dataset”: ESet_object_sc_combined.rdata (Input dataset)
        • In “Advanced scRNA Parameters”:
          • “Cell Types Label from scRNA dataset”: Inferred cell type - author labels
          • “Samples Identifier from scRNA dataset”: Individual
          • “Comma list of cell types to use from scRNA dataset”: alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
        • In “Bulk Datasets in scRNA Group”:
          • param-repeat “Insert Bulk Datasets in scRNA Group”
            • “Name of Bulk Dataset”: Bulk_set:Normal
            • param-file “Bulk RNA Dataset”: ESet_object_bulk_healthy.rdata (Input dataset)
            • “Factor Name”: Disease
          • param-repeat “Insert Bulk Datasets in scRNA Group”
            • “Name of Bulk Dataset”: Bulk_set:T2D
            • param-file “Bulk RNA Dataset”: ESet_object_bulk_T2D.rdata (Input dataset)
            • “Factor Name”: Disease
  2. To each of the outputs, add the #altogether tag.

There are four sets of output files.

  1. Summarised Plots <- This is the most interesting output, because it has the pretty pictures!
  2. Individual Heatmaps <- This kind of does what standard (non-Comparing) MuSiC does for each sample, rather than combining them.
  3. Stats <- This will be very handy if you want to make any statistical calculations, as it contains medians and quartiles
  4. Tables <- This contains the cell proportions found within each sample as well as the number of reads.

Summarised Plots

Examine galaxy-eye the output file Summarised Plots (MuSiC). Now the first few pages are similar to the standard deconvolution tool, but now comparing across the factor of interest (disease). Among the myriad of visualisations available, our favourite is on page 5 - a comparison of inferred cell proportions across disease.

Graph showing two rows for each cell type (gamma, ductal, delta, beta, alpha, and acinar cells) comparison normal or T2D proportions by either read or by sample.Open image in new tab

Figure 2: Altogether: Resulting Graph

Here we can see that the bulk-RNA seq samples from the T2D patients contain markedly fewer beta cells as compared with their healthy counterparts. This makes sense, so that’s good!

Individual Heatmaps

Examine galaxy-eye the output file Individual heatmaps (MuSiC). This shows the cell distribution across each of the individual samples, separated out by disease factor into two separate plots, but ultimately isn’t particularly informative.

Heatmap showing vertically four samples (Bulk) and horizontally each cell type (from alpha to ductal), with gradations of colour referring to the proportion of that cell type within each sample.Open image in new tab

Figure 3: Altogether: Individual heatmaps

Stats

If you select the Stats dataset, you’ll find it contains four sets of data, Bulk_disease: Read Props, Bulk_disease: Sample Props, Bulk_healthy: Read Props and Bulk_healthy: Sample Props. Examine galaxy-eye the file Bulk_disease: Sample Props. This contains summary statistics (Min, quartiles, median, mean, etc.) for each phenotype. This could be quite helpful if you’re trying to statistically identify differences across samples.

Table with rows as cell types and columns as statistics: Min, 1st Qu., Median, Mean.Open image in new tab

Figure 4: Altogether: Stats

Tables

Finally, if you select the Tables dataset, you’ll find it contains three sets of data, Data Table, Matrix of Cell Type Read Counts, and Matrix of Cell Type Sample Proportions.

Examine galaxy-eye the file Data Table. This contains the inferred proportions and reads associated with each sample and cell type, along with its important factor of interest (Disease). In this tutorial, we tend to use sample proportions rather than read count, but either works. The two other matrix files are just portions of this data table.

Table with individual rows for each cell type within each sample, with columns of Cell, Factor, CT Prop in Sample and Number of Reads.Open image in new tab

Figure 5: Altogether: Data table
Question
  1. Why does the data table contain 42 rows?
  1. The data table contains a row for each cell type within each sample. Since there are 6 cell types and 7 samples, 6*7 = 42 rows.

Hopefully, this has been illuminating! Now let’s try two other ways of inferring from a reference and see if it makes a difference.

Like4like: Deconvolution of healthy samples with a healthy reference and diseased samples with a diseased reference

Hands-on: Like4like Inference
  1. MuSiC Compare ( Galaxy version 0.1.1+galaxy4) with the following parameters:
    • In “New scRNA Group”:
      • param-repeat “Insert New scRNA Group”
        • “Name of scRNA Dataset”: scRNA_set:Normal
        • param-file “scRNA Dataset”: ESet_object_sc_healthy.rdata (Input dataset)
        • In “Advanced scRNA Parameters”:
          • “Cell Types Label from scRNA dataset”: Inferred cell type - author labels
          • “Samples Identifier from scRNA dataset”: Individual
          • “Comma list of cell types to use from scRNA dataset”: alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
        • In “Bulk Datasets in scRNA Group”:
          • param-repeat “Insert Bulk Datasets in scRNA Group”
            • “Name of Bulk Dataset”: Bulk_set:Normal
            • param-file “Bulk RNA Dataset”: ESet_object_bulk_healthy.rdata (Input dataset)
            • “Factor Name”: Disease
      • param-repeat “Insert New scRNA Group”
        • “Name of scRNA Dataset”: scRNA_set:T2D
        • param-file “scRNA Dataset”: ESet_object_sc_T2D.rdata (Input dataset)
        • In “Advanced scRNA Parameters”:
          • “Cell Types Label from scRNA dataset”: Inferred cell type - author labels
          • “Samples Identifier from scRNA dataset”: Individual
          • “Comma list of cell types to use from scRNA dataset”: alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
        • In “Bulk Datasets in scRNA Group”:
          • param-repeat “Insert Bulk Datasets in scRNA Group”
            • “Name of Bulk Dataset”: bulk_set:T2D
            • param-file “Bulk RNA Dataset”: ESet_object_bulk_T2D.rdata (Input dataset)
            • “Factor Name”: Disease
  2. Add the #like4like tag to each of the outputs.
Question
  1. How have the cell inferences changed, now that we have changed the scRNA references used?
Two graphs showing two rows for each cell type (gamma, ductal, delta, beta, alpha, and acinar cells) comparison normal or T2D proportions by either read or by sample, with the top graph labelled #altogether and the bottom labelled #like4like. Differences are more pronounced in the top #altogether graph.Open image in new tab

Figure 6: Altogether vs like4like
  1. Overall, our interpretation here is that the differences are less pronounced. It’s interesting to conjecture whether this is an artefact of analysis, or whether - possibly - the beta cells in the diseased samples are not only fewer, but also contain fewer beta-cell specific transcripts (and thereby inhibited beta cell function), thereby lowering the bar for the inference of a beta cell and leading to a higher proportion of interred B-cells.

Let’s try one more inference - this time, we’ll use only healthy cells as a reference, to (theoretically) make a more consistent analysis across the two phenotypes.

healthyscref: Deconvolution using only healthy cells as a reference

Hands-on: Healthy sc reference only inference
  1. MuSiC Compare ( Galaxy version 0.1.1+galaxy4) with the following parameters:
    • In “New scRNA Group”:
      • param-repeat “Insert New scRNA Group”
        • “Name of scRNA Dataset”: scRNA_set:Normal
        • param-file “scRNA Dataset”: ESet_object_sc_healthy.rdata (Input dataset)
        • In “Advanced scRNA Parameters”:
          • “Cell Types Label from scRNA dataset”: Inferred cell type - author labels
          • “Samples Identifier from scRNA dataset”: Individual
          • “Comma list of cell types to use from scRNA dataset”: alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
        • In “Bulk Datasets in scRNA Group”:
          • param-repeat “Insert Bulk Datasets in scRNA Group”
            • “Name of Bulk Dataset”: Bulk_set:Normal
            • param-file “Bulk RNA Dataset”: ESet_object_bulk_healthy.rdata (Input dataset)
            • “Factor Name”: Disease
          • param-repeat “Insert Bulk Datasets in scRNA Group”
            • “Name of Bulk Dataset”: Bulk_set:T2D
            • param-file “Bulk RNA Dataset”: ESet_object_bulk_T2D.rdata (Input dataset)
            • “Factor Name”: Disease
  2. Add the #healthyscref tag to each of the outputs.
Question
  1. How have the cell inferences changed this time?
Three graphs showing two rows for each cell type (gamma, ductal, delta, beta, alpha, and acinar cells) comparison normal or T2D proportions by either read or by sample, with the top graph labelled #altogether; the middle labelled #like4like; and the bottom labelled #healthyscref. Differences are most pronounced in the bottom #healthyscref graph.Open image in new tab

Figure 7: The impact of the single cell reference
  1. If using a like4like inference reduced the difference between the phenotype, aligning both phenotypes to the same (healthy) reference exacerbated them - there are even fewer beta cells in the output of this analysis.

Overall, it’s important to remember how the inference changes depending on the reference used - for example, a combined reference might have majority healthy samples or diseased samples, so that would impact the inferred cellular compositions.

Conclusion

Congrats! You’ve made it to the end of this suite of deconvolution tutorials! You’ve learned how to find quality data for reference and for analysis, how to reformat it for deconvolution using MuSiC, and how to compare cellular inferences using multiple kinds of reference datasets. You can find the workflow for this tutorial and an example history.

We hope this helps you in your research!

Workflow editor showing 5 inputs and 3 runs of the MuSiC Compare tool. Open image in new tab

Figure 8: MuSiC Compare Tutorial Workflow

This tutorial is part of the https://singlecell.usegalaxy.eu portal (Tekman et al. 2020).

feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users

We also post new tutorials / workflows there from time to time, as well as any other news.

point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.

tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet