Filter, plot, and explore single cell RNA-seq data with Seurat

Author(s)	Camila Goclowski
Editor(s)	Pavankumar Videm Helena Rasche
Reviewers
Infrastructure	Pablo Moreno

Overview
Questions:

Is my single cell dataset a quality dataset?

How do I pick thresholds and parameters in my analysis? What’s a “reasonable” number, and will the world collapse if I pick the wrong one?

How do I generate and annotate cell clusters?

Objectives:

Interpret quality control plots to direct parameter decisions

Repeat analysis from matrix to clustering to labelling clusters

Identify decision-making points

Appraise data outputs and decisions

Explain why single cell analysis is an iterative process (i.e. the first plots you generate are not final, but rather you go back and re-analyse your data repeatedly)

Requirements:

Introduction to Galaxy Analyses

slides Slides: An introduction to scRNA-seq data analysis

tutorial Hands-on: Pre-processing of 10X Single-Cell RNA Datasets

Time estimation: 3 hours

Supporting Materials:

Datasets

Workflows

galaxy-history-input Input Histories

usegalaxy.eu

usegalaxy.eu-ARCHIVED
2024-12-10

help How to Use This

galaxy-history-answer Answer Histories

usegalaxy.eu
2024-06-03

usegalaxy.eu - ARCHIVED
2024-12-10

help How to Use This

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.fr ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

Published: Apr 9, 2024

Last modification: Mar 11, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00438

rating Rating: 4.0 (1 recent ratings, 1 all time)

version Revision: 10

You’ve previously done all the work to make a single cell matrix. Now it’s time to fully process our data using Seurat: remove low quality cells, reduce the many dimensions of data that make it difficult to work with, and ultimately try to define clusters and find some biological meaning and insights! There are many packages for analysing single cell data - Seurat (Satija et al. 2015), Scanpy (Wolf et al. 2018), Monocle (Trapnell et al. 2014), Scater (McCarthy et al. 2017), and many more. We’re working with Seurat because it is well updated, broadly used, and highly trusted within the field of bioinformatics.

Comment

This tutorial is significantly based on the Seurat documentation(Satija et al. 2015) as well as Seurat’s Guided Clustering Tutorial.

Agenda

In this tutorial, we will cover:

Get Data onto Galaxy and generate a Seurat object

Get Data onto Galaxy

Generating a Seurat object

Get Data onto Galaxy

Generating a Seurat object

QC Plots

1. Sex Differences in Counts?

2. Genotype Differences in Counts?

Finding Our Filtering Parameters

Applying our Thresholds

Processing

1. Normalization

2. Identifying Variable Genes

3. Scale Data

Dimensionality Reduction

Let’s Take a Look

Differential Expression Testing: Finding Markers

Biological Interpretations

Keep Digging

Technical Assessment

We’ll provided you with experimental data to analyse from a mouse dataset of fetal growth restriction (Bacon et al. 2018). This is the full dataset generated from this tutorial.

Get Data onto Galaxy and generate a Seurat object

Get Data onto Galaxy

To start, let’s get our dataset loaded into Galaxy.

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

There are multiple ways in which to collect the data for this tutorial. I find it easiest to do so via the EBI Data Retrieval.

EBI Data Retrieval Importing from a history Uploading from Zenodo

Hands On: EBI Data Retrieval

You may retrieve that files necessary to construct a Seurat Object in this way. Doing to will alleviate the necessity to convert AnnData (Python) objects into Seurat (R) objects:

Run EBI SCXA Data Retrieval ( Galaxy version v0.0.2+galaxy2) with the following parameters:

“SC-Atlas experiment accession”: E-MTAB-6945

“Choose the type of matrix to download”: Raw filtered counts

Hands On: Importing from a history

You can import this history

Open the link to the shared history

Click on the Import this history button on the top left

Enter a title for the new history

Click on Copy History

This also alleviates the necessity to convert the AnnData object into a Seurat one, which is an additional step you must complete if you choose to use the next method.

Generating a Seurat object

You now should have imported the matrix.mtx, genes.tsv, barcodes.tsv, and exp_design.tsv files into your Galaxy history. In order for Seurat tools to work, we will have to convert the data into a format that Seurat recognizes. To do so, we will add row and column names to our matrix. In the end, this will leave us with a matrix whose rows are gene names, columns are cell barcodes, and each value in the matrix represent the expression value of a given gene in a given cell.

This can be accomplished via the Read10x step. Read10x tool implements Seurat’s function to create a matrix and add in feature and barcode names simultaneously:

Hands On: Read10X

Run Seurat Read10x ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“Expression matrix in sparse matrix format (.mtx)”: EBI SCXA Data Retrieval on E-MTAB-6945 matrix.mtx (Raw filtered counts)

“Gene table”: EBI SCXA Data Retrieval on EMTAB-6945 genes.tsv (Raw filtered counts)

“Barcode/cell table”: EBI SCXA Data Retrieval on E-MTAB-6945 barcodes.tsv (Raw filtered counts)

“Cell Metadata”: EBI SCXA Data Retrieval on E-MTAB-6945 exp_design.tsv

“Minimum cells to include features”: 5

“Choose the format of the output”: RDS with a Seurat object

Rename galaxy-pencil output Initial Seurat Object

The output of this tool will result in a Seurat object with row/column names as described above.

Get Data onto Galaxy

Hands On: Uploading from Zenodo
Create a new history for this tutorial
Import the AnnData object from Zenodo
https://zenodo.org/record/7053673/files/Mito-counted_AnnData
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Rename galaxy-pencil the datasets Mito-counted AnnData

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to Mito-counted AnnData

Click the Save button

Check that the datatype is h5ad

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select h5ad from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Generating a Seurat object

When you uploaded your data from Zenodo, it came in AnnData format, thus we will need to convert this to a Seurat Object. This can be easily accomplished using the Seurat FilterCells tool. Simply run the tool without any actual filtering thresholds and with the following parameters:

Hands On: Filter Cells

Run Seurat FilterCells ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“Choose the format of the input”: AnnData

“AnnData file”: Mito-counted AnnData

“Choose the format of the output”: RDS with a Seurat object

You’ve created a Seurat object, congratulations!

QC Plots

Now that we have a complete Seurat object, we can begin the filtering process.

There will be a number of ‘cells’ that are actually just empty droplets or low-quality. There will also be genes that could be sequencing artifacts or that appear with such low frequency that statistical tools will fail to accurately analyse them.

This background noise of both cells and genes not only makes it harder to distinguish real biological information from artifacts, but also makes it computationally demanding to analyze.

We want to filter our cells, but first we need to know what our data looks like. There are a number of subjective decisions to make within scRNA-seq analysis, for instance we now need to make informed decisions about where to set our thresholds (more on that soon!).

We’re going to plot our data a few different ways. Different bioinformaticians might prefer to see the data in different ways, and here we are only generating a few of the plots you can use. Ultimately you need to go with what makes the most sense to you.

So let’s generate some QC plots. First off, let’s visualize the spread of our data:

Hands On: Visualize Counts

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Initial Seurat Object (output of Seurat Read10x tool)

“Plot_type_selector”: VlnPlot

“Features”: nCount_RNA

“Log”: Yes

Violin Plot of Counts. — **Figure 1**: Violin Plot of counts.

This plot will show us the spread of cells in our data containing a given number of counts (or transcripts) observed in a given cell. We can use this plot, and others like it in a moment, to help filter out the uninformative cells.

In a similar fashion we can visualize the spread of cells in our data expressing a given number of features (or genes):

Hands On: Visualize Features

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Initial Seurat Object (output of Seurat Read10x tool)

“Plot_type_selector”: VlnPlot

“Features”: nFeature_RNA

“Log”: Yes

Violin Plot of Features. — **Figure 2**: Violin Plot of features.

Now, we could just pick filtering thresholds based on these plots, and in a typical pipeline we would also plot the proportion of features that map to the mitochondrial genome (a tool coming soon to do so!). In the meantime, let’s do some QC checks.

We can, and should, ask a number of questions about the quality of our data before conducting any actual analyses. Batch effect, for example carries the potential to alter the conclusions we make. Let’s take a look at whether this may be the case here:

Hands On: Visualize Counts Split by Individual

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Initial Seurat Object (output of Seurat Read10x tool)

“Plot_type_selector”: VlnPlot

“Features”: nCount_RNA

“Group by”: Sample.Characteristic.individual.

“Split by”: Sample.Characteristic.individual.

“Log”: Yes

“Fill by”: ident

Violin Plot of Counts Split by Individual. — **Figure 3**: Violin Plot of counts split by individual.

This plot shows us the number of cells split by the individual (mouse) from which the cells came from. Now, depending on your experimental design, batch may be represented by something other than individual–like timepoint or even the wet lab researcher who isolated the cells.

Ideally, we would like to see a relatively even distribution of counts for each individual (or batch) but if there isn’t, fear not, we can regress this variable out in a later step.

In order to accurately assess potential batch effects, use the “group.by” parameter to indicate the variable which differed across experiments. If you are analyzing your own data, try plotting another variable which you know differed across experiments or even just samples.

Now let’s get an idea of how other variables, like sex or genotype of the mice, might be represented across our dataset.

1. Sex Differences in Counts?

Hands On: Visualize Counts Split by Sex

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Initial Seurat Object (output of Seurat Read10x tool)

“Plot_type_selector”: VlnPlot

“Features”: nCount_RNA

“Group by”: Sample.Characteristic.sex.

“Split by”: Sample.Characteristic.sex.

“Log”: Yes

“Fill by”: ident

Violin Plot of Counts Split by Sex with males representing more of the counts than females but distributions are about the same across both sexes. — **Figure 4**: Violin Plot of counts split by sex.

2. Genotype Differences in Counts?

Hands On: Visualize Counts Split by Genotype

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Initial Seurat Object (output of Seurat Read10x tool)

“Plot_type_selector”: VlnPlot

“Features”: nCount_RNA

“Group by”: Sample.Characteristic.genotype.

“Split by”: Sample.Characteristic.genotype.

“Log”: Yes

“Fill by”: ident

Violin Plot split by Genotype--Mutant versus Control. — **Figure 5**: Violin Plot of counts split by Genotype--Mutant versus Control.

Finding Our Filtering Parameters

Now that we have a better understanding of what our data looks like, we can begin identifying those spurious reads and low quality cells and then remove them.

In a standard workflow, we would plot the percentage of mitochondrial reads (perc.mt) against the total number of reads per cell (nCount_RNA) and number of genes per cell (nFeature_RNA) to get an idea of our thresholds.

Comment: High Mitochondrial Reads

High perc.mt will typically indicate stressed out cells (often due to the extraction, sorting, or sample prep protocols).

However, this is not always necessary, and in fact, filtering based on counts and features (indeed, even just counts alone) will often remove the cells with spuriously high mitochondrial transcripts. These tools (and this tutorial) will soon be updated to allow us to do so–in the meantime, please see Filter, Plot, and Explore single cell RNA-seq data (Seurat, R) Filter, plot and explore single-cell RNA-seq (Scanpy), or Filter, plot and explore single-cell RNA-seq data (Scanpy, Python) if you hope to include perc.mt in your own data analysis adventures.

For now, we will use just transcript and gene counts to filter our data. Let’s take a look back at our nFeature Violin Plot to pick our gene threshold:

Comment: Interpretations

You can see that very few cells in the dataset contain fewer than ~500 genes. Biologically, this makes sense, and the cells appear to be outliers in the data. As such, we will set our lower threshold of genes (nFeature_RNA) at 500.

Now, what about transcripts (nCount_RNA)? Let’s take a look:

Comment: Interpretations

This one is a bit more difficult to visualize but we see a severe drop off in the number of cells that contain fewer than 500 and more than 10,000 transcripts. These will be our nCount thresholds that we filter based on.

These cells won’t tell us much biologically, rather, they will contribute noise that we’ll want to filter out of the data. With that being said, filtering scRNA-seq data will always be an iterative process–so label your work well and be ready to revisit these thresholds if your analyses seem strange down the line.

Applying our Thresholds

It’s time to filter our cells by applying the above thresholds!

In order to include more than one parameter by which to filter, use the “Insert Subsets used to filter cells” button below the first parameter box.

Hands On: Filter Cells

Run Seurat FilterCells ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Initial Seurat Object (output of Seurat Read10x tool)

In “Subsets used to filter cells”:

“Name of Parameter to filter on”: nCount_RNA

“Min value”: 500.0

“Max value”: 10000

param-repeat “Insert Subsets used to filter cells”

“Name of Parameter to filter on”: nFeature_RNA

“Min value”: 500.0

“Max value”: 1000000000.0

“Choose the format of the output”: RDS with a Seurat object

Rename galaxy-pencil output Filtered Seurat Object

In this step we are creating a new Seurat object (notice that the selected output of this tool will be an RDS file as opposed to the png plots we have thus far been creating).

Now, genes that do not appear in any cell, or even in only 1 or 2 cells, may break some analytical tools and will generally not be biologically informative.

Since you’ve removed a whole heap of cells, and the captured genes are sporadic (i.e. a small percentage of the overall transcriptome per cell) this means there are a number of genes still present in your matrix that are not expressed in any of the cells.

The removal of these genes is by no means necessary, but will speed up your analyses. The developers are currently working to enable a means of doing this through the Seurat Tools, but, in the meantime if you are analyzing your own data and would like to filter genes–please see Filter, Plot, and Explore single cell RNA-seq data (Seurat, R) Filter, plot and explore single-cell RNA-seq (Scanpy), or Filter, plot and explore single-cell RNA-seq data (Scanpy, Python).

Processing

Currently, we still have quite big data. We have two issues here

We already saw in our filtering plots that there are differences in how many transcripts and genes have been counted per cell. This technical variable could, and likely will, distract us from identifying true biological differences.
We like to plot things on 2-dimensional X/Y plots. So, for instance, Gapdh could be on one axis, and Actin could be on another, and then each cell is plotted onto that 2D axis based on how many of each transcript they possess.

Although that would be fine, adding in a 3rd dimension (or, indeed, in our case, a dimension for each of the thousands of genes), is a bit trickier.

So, our next steps will be to transform our big data object into something that is easy to analyse and easy to visualize: this is commonly referred to as preprocessing of the data and a typical scRNA-seq preprocessing pipeline will include the following steps:

1. Normalization

What is Normalization?

Normalisation helps reduce the differences between gene and UMI counts by fitting total counts across cells in our data to be comparable to one another. SCTransform regularizes the gene expression profiles via a negative binomial regression while also controlling for overfitting of the data.

Hands On: Normalize Data

Run Seurat NormaliseData ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Filtered Seurat Object (output of Seurat FilterCells tool)

“Normalisation method”: Log Normalise

Rename galaxy-pencil output Normalised Seurat Object

2. Identifying Variable Genes

What are variable genes?

The datasets have loads of genes, but not all of them vary in expression from cell to cell. For instance, housekeeping genes are defined as not changing much from cell to cell, so we could remove these from our data to simplify our analyses.

The find variable genes step flags genes that do vary across cells to expedite future analyses and ensure that we, and Seurat, don’t waste time looking for meaningful differences where they don’t exist.

Hands On: Find Variable Genes

Run Seurat FindVariableGenes ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Normalised Seurat Object (output of Seurat NormaliseData tool)

“Choose the format of the output”: RDS with a Seurat object

Rename galaxy-pencil output Normalised Seurat Object with Variable Features

This tool will output two new pieces of data into our Galaxy history:

a new Seurat object with variable features identified and flagged
a tabular file with a list of these variable genes.

This gene list may be used as a sneak peak into understanding what the dataset will look like! We can begin to understand which genes are going to be driving downstream clustering of our cells and maybe even make some decisions about whether we are happy with our filtering based on this list.

3. Scale Data

Now we will scale the data.

What is scaling?

This is an important step to set up our data for further dimensionality reduction. It will transform the dataset such that all genes have the same variance and a zero mean. It helps negate sequencing depth differences between samples, since the gene levels across the cells become comparable.

Comment: Don't Worry!

Note, that the differences from scaling etc. are not the values you have at the end - i.e. if your cell has average GAPDH levels, it will not appear as a ‘0’ when you calculate gene differences between clusters.

Hands On: Scale Data

Run Seurat ScaleData ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Normalised Seurat Object with Variable Features (output of SeuratFindVariableGenes tool)

“Choose the format of the output”: RDS with a Seurat object

“Genes to use”: Seurat FindVariableGenes on data 12: Variable genes tabular file

“Vars to regress”: nCount_RNA

“Statistical model”: Linear model

Rename galaxy-pencil output Scaled, Preprocessed Seurat Object

You now have a preprocessed Seurat object!

Comment: Regressing Variables

Take note of the “Vars to regress” argument in the above tool. This function allow us to mitigate the effects of confounding factors in our dataset. In true research practice, I often regress out multiple variables including but not limited to perc.mt, cell cycle scoring, and feature count. As currently written, this tool only allows us to regress out a single variable: so feel free to pick another to regress and see how it changes the downstream analyses!

Dimensionality Reduction

Although we’ve made our expression values comparable to one another and our overall dataset less computationally demanding, we still have way too many dimensions (n cells x n genes!).

Transcript changes are not usually singular–which is to say, genes function and exist in pathways and groups. It would be easier to analyse our data if we could group these differences. To address this we will run principal component analysis (PCA).

Comment: What is PCA?

Principal components (PCs) are calculated from highly dimensional data to find the most representative spread in the dataset. So in our highly variable gene dimensions, there will be one line (axis) that yields the most spread and variation across the cells. That will be our first principal component.

We can calculate the first handful of principal components in our data to drastically reduce the number of dimensions:

You’ll notice that the Seurat RunPCA tool is run using the variable features from the previous step. This signficantly decreases the number of genes, and their expression changes, that must be grouped into principal components by this step.

Hands On: Run PCA

Run Seurat RunPCA ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Scaled, Preprocessed Seurat Object (output of Seurat ScaleData tool)

“Choose the format of the output”: RDS with a Seurat object

“Genes to scale”: Seurat FindVariableGenes on data 12: Variable genes tabular file

Rename galaxy-pencil RDS output PCA Processed Seurat Object

This tool will output you with four new datasets into your history:

Seurat RDS which includes all of the following PCA metadata
Embeddings: Principal component values for each of the cells in your dataset
Loadings: Prinicial component values for each of the genes in your dataset
Standard deviations of each principal component coordinates

In order to use the PCA information which was just calculated, we must visualize it. The currently available tools unfortunately do not [YET] carry the capacity to do so, but I will provide the plot so that we may make an informed decision together:

Open image in new tab

Figure 8: Elbow Plot

We can see that there is really not much variation explained past the 9th PC. So we might save ourselves a great deal of time and muddied data by focusing on the top 15 PCs to be conservative. You can also think about it like choosing a threshold of variance explained. Conservatively, 2.5 standard deviations are explained by about 10 of the PCs.

We’re still looking at around 15 dimensions at this point–likely not the easiest to visualize. To make our lives easier, we need to identify how similar a cell is to another cell, across every cell across each of these dimensions.

For this, we will use the k-nearest neighbor (kNN) graph, to identify which cells are close together and which are not.

The kNN graph plots connections between cells if their distance (when plotted in this 10 dimensional space!) is amongst the k-th smallest distances from that cell to other cells. This will be crucial for identifying clusters, and is necessary for plotting a UMAP–which is what will ultimately allow us to visualize our data in 2 dimensions.

Comment: From UMAP developers:

“Larger neighbor values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default”.

Let’s now use the 15 PC threshold we chose from the Elbowplot and apply it to find neighbors:

Hands On: Find Neighbors

Run Seurat FindNeighbours ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: PCA Processed Seurat Object (output of Seurat RunPCA tool)

“Reduction”: pca

“Dimensions”: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

“Assay”: RNA

Rename galaxy-pencil output Preprocessed Seurat Object with Neighbors

Now we can use the neighborhood graph to identify clusters of cells whose transcriptional profiles appear most similar to one another: we can identify and label clusters:

Hands On: Find Clusters

Run Seurat FindClusters ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Preprocessed Seurat Object with Neighbors (output of Seurat FindNeighbors tool)

In “Advanced Options “

“Resolution”: 0.5

Rename galaxy-pencil output Preprocessed Seurat Object with Clusters

This tool will output two new datasets: as usual, a new Seurat object which includes a metadata column denoting which cluster each cell was assigned to, and a csv file of the same information.

Unfortunately, identifying clusters is not as majestic as biologists often think - the math doesn’t necessarily identify true cell clusters. Every algorithm for identifying cell clusters falls short of a biologist knowing their data, knowing what cells should be there, and proving it in the lab.

So, we’re going to make the best of it as a starting point and see what happens! We will define clusters from the kNN graph, based on how many connections cells have with one another. Roughly, this will depend on a resolution parameter for how granular you want to be.

The resolution parameter available in the advanced options of this tool allows for you, the bioinformatician, to dictate the granularity of the clusters.

For example, a higher clustering resolution dictates increased granularity, and more stringent clusters. That is–cells must more closely resemble one another in order to be grouped into the same cluster than at a lower clustering resolution.

In general, I find it easiest to think of a higher resolution producing more clusters and conversely, a lower resolution will produce less clusters. This parameter is a useful one that you will use often to help decipher how many true populations of cells are present in your data!

Now that we have made note within our object of which cells cluster together, we can start to visualize our data! Two major visualizations for this data type are tSNE and UMAP. We can calculate the coordinates for both prior to visualization. For tSNE, the parameter perplexity can be changed to best represent the data, while for UMAP the main change would be to change the kNN graph above itself, via the Seurat FindNeighbors tool.

UMAP is the most recently developed, and most widely used dimensionality reduction for visualization of principal component data. It has been optimized since tSNE to better preserve global structure and is less computationally demanding.

Hands On: Run UMAP

Run Seurat UMAP ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Preprocessed Seurat Object with Clusters (output of Seurat FindClusters tool)

“Choose the format of the output”: RDS with a Seurat object

“Dims”: 1:15

Rename galaxy-pencil output Final Preprocessed Seurat Object

You now have a completely preprocessed and ready to be analyzed Seurat object–congratulations!

Let’s Take a Look

Now that we have run dimensionality reduction on our dataset, it is ready for visualization. Let’s take a look at what our cells look like in a UMAP projection:

Hands On: Plot UMAP

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: DimPlot

“Group by”: RNA_nn_res.0.5

DimPlot colored by 0.5 resolution clusters with a total of 8 clusters--two noticeably different "island" populations and six clustering more closely to one another. — **Figure 9**: DimPlot colored by 0.5 resolution cluster.

Good work! It looks like with a clustering resolution of 0.5, we are able to identify 8 clusters of cells in our data.

We can also look for expression of particular genes and see how those map to our UMAP projection. This is often useful in getting an initial understanding of which clusters might be representative of which cell types.

Hands On: Plot Gapdh

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Gapdh

FeaturePlot of Gapdh expression shown boradly across the whole dataset. — **Figure 10**: FeaturePlot: Gapdh

We just plotted a housekeeping gene, Gapdh, so the broad expression we observe is expected.

In practice, it is helpful to plot known markers of cell types you expect to be in your dataset. This will give you a first look at how your cells are clustered.

For example, we can plot early T-cell marker Il2ra and get an idea of which cells and/or clusters might resemble the early T-cells:

Hands On: Plot Il2ra

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Il2ra

FeaturePlot of Il2ra expression most highlky expressed in the lower island population (cluster 2). — **Figure 11**: FeaturePlot: Il2ra

It is a good idea, when analyzing your own data, to plot some markers of cell types you expect to be present. Later on we can also use these FeaturePlots to visualize manual annotation of clusters.

Differential Expression Testing: Finding Markers

Because each cluster of cells was grouped based on similar transcriptome profiles, each cluster will inherently differ from one another based on a set of “marker” genes.

Following an initial look at the DimPlots and FeaturePlots, we can take an even closer look at which genes are driving the clustering.

To do so, we’ll run Seurat’s FindMarkers function, which will compare each identity (in this case cluster) against every other identity within its class (all the other clusters). This function of marker finding is particularly useful in identifying up, or down, regulated genes that drive differences in identity/cluster.

Hands On: Find Markers

Run Seurat FindMarkers ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

The marker list that has been output by this tool will be useful to us shortly for identifying which cells types are represented by the various clusters.\

Comment: On Finding Markers

Differential expression can be run using any identity class and any two identities within the same class. The tools as of current do not allow for such direct comparisons (but will soon!) and I implore you to try Filter, Plot, and Explore single cell RNA-seq data (Seurat, R) if you hope to conduct marker identification across, say, genotypes.

Biological Interpretations

Now it’s the fun bit! We can see where genes are expressed, and start considering and interpreting the biology of it. At this point, it’s really about what information you want to get from your data–the following is only the tip of the iceberg. However, a brief exploration is good, because it may help give you ideas going forward for your own data. Let’s start interrogating our data!

Let’s take another look at what our clusters look like:

Hands On: Plot UMAP

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: DimPlot

“Group by”: RNA_nn_res.0.5

DimPlot colored by 0.5 resolution cluster showing 8 total clusters. — **Figure 12**: DimPlot colored by 0.5 resolution cluster.

Comment: On Cluster Numbering

Note that Seurat’s cluster numbering is based on size alone, so clusters 0 and 1 are not necessarily related, they are just the clusters containing the most cells.

It would be nice to know what these cells are. This analysis (googling all of the marker genes, both checking where the ones you know are and then going through marker tables we generated) is a fun task for any individual experiment, so we’re going to speed past that and nab the assessment from the original paper!

Here are the markers per cell type that the paper uses to classify the avrious cell types which are expected to be present in the data:

Markers	Cell Type
Il2ra	Double negative (early T-cell)
Cd8b1, Cd8a, Cd4	Double positive (middle T-cell)
Cd8b1, Cd8a, Cd4 - high	Double positive (late middle T-cell)
Itm2a	Mature T-cell

We can plot these markers as a means of discerning which cluster might be representing a given cell type. Let’s start with the Mature T-cell marker Itm2a:

Hands On: Plot Itm2a

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Itm2a

FeaturePlot of Itm2a expression most highly localized to cluster 3. — **Figure 13**: FeaturePlot of Itm2a

We can see that Cluster 3 seems to be most highly expressing this gene, and when we look back at the marker list we created earlier, Itm2a appears as a marker for cluster 3 with an average log fold change of 3! Therefore, we can quite conficently say that cluster 3 is likely to be representing Mature T-cells!

Now what about the opposite end of the spectrum: the double negative early T-cells? We can see from our table about that Il2ra is a known marker of this cell type, let’s see where it’s most strongly expressed in the data:

Hands On: Plot Il2ra

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Il2ra

FeaturePlot of Il2ra expression mainly in cluster 2. — **Figure 14**: FeaturePlot of Il2ra

Il2ra expression seems to be most prominent in Cluster 2 and as seen in our marker list, has a fold change of 2.79!

It can, and should, act as a sanity check that the most (and least) differentiated cell types expected in the data appear as two “island” clusters.

Now for the intermediate populations–which may be a bit more tricky to deconvolute. Based on our known markers, we see that both the double-positive populations express many of the same genes: Cd8b1, Cd8a, and Cd4. Let’s make sure that the remaining clusters (0, 1, 3, 4, 5, 6, and 7) all express these:

Hands On: Plot Cd8b1

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Cd8b1

FeaturePlot of Cd8b1 expression expressed across each of the "body" clusters (0, 1, 4, 5, 6, and 7). — **Figure 15**: FeaturePlot of Cd8b1

It looks like clusters 1, 4, 5, and 6 pretty strongly express Cd8b1, now what about Cd8a?

Hands On: Plot Cd8a

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Cd8a

FeaturePlot of Cd8a expression in the same clsuters as Cd8b1 (0, 1, 4, 5, 6, and 7). — **Figure 16**: FeaturePlot of Cd8a

This looks pretty consistent with the Cd8b1 plot, which is expected as these are markers of a double positive population! The true discernment between the two populations will be the level of Cd4 expression:

Hands On: Plot Cd4

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Cd4

FeaturePlot of Cd4 expression most highly expressed in cluster 1 and lowly expressed in clusters 4, 5, 6, and 7. — **Figure 17**: FeaturePlot of Cd4

Looks like the top portion of these clusters, and mainly cluster 1 hold the vast majority of Cd4 expression. This, coupled with Cd4 being identified as a marker of cluster 1 in our marker table above tells us pretty confidently that the upper part of this “body” cluster are likely the late middle t-cells!

Here is a breakdown of the cell type labeling we just accomplished:

Clusters	Markers	Cell Type
2	Il2ra	Double negative (early T-cell)
0	Il2ra, Cd8b1, Cd8a	[maybe] Early transitioning T-cell
4, 5, 6, 7	Cd8b1, Cd8a, Cd4	Double positive (middle T-cell)
1	Cd8b1, Cd8a, Cd4 - high	Double positive (late middle T-cell)
3	Itm2a	Mature T-cell

Now we can begin to feel a bit more oriented in exploring our data. The clusters are labelled with cell types, and our object has been processed enough such that we may now begin to answer some real biological questions! Now that we know what we’re dealing with, let’s examine the effect of our variable, real science!

Keep Digging

Are there any differences in genotype? Or in biological terms, is there an impact of growth restriction on T-cell development in the thymus? We can begin to answer this question visually by using the “split.by” parameter in Seurat’s plot functions.

Hands On: Visualize Counts Split by Genotype

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: VlnPlot

“Features”: nCount_RNA

“Group by”: RNA_nn_res.0.5

“Split by”: Sample.Characteristic.genotype.

“Log”: Yes

“Fill by”: ident

VlnPlot colored by cluster and split by genotype showing the mutant clusters appearing more sparse than the control. — **Figure 18**: Violin Plot colored by assigned clusters & split by genotype

We can see that there seems to be a decrease in cellcounts across the celltypes in the het mutant… INTERESTING! What next? We might look further at the transcripts present in both those populations, and perhaps also look at the genotype marker table… So much to investigate! But before we set you off to explore to your heart’s delight, let’s also look at this a bit more technically.

Technical Assessment

Is our analysis real? Is it right? Well, we can assess that a little bit.

First thing’s first, is there a batch effect?

Hands On: Visualize Counts Split by Individual

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: DimPlot

“Group by”: RNA_nn_res.0.5

“Split by”: Sample.Characteristic.individual.

DimPlot colored by labelled celltype split by individual/batch showing that individual 3 and 4 contribute the vast majority of immature T-cells. — **Figure 19**: DimPlot colored by assigned clusters split by individual/batch

While some differences across batch are expected and nothing to be concerned about, the immature T-cells looks to be mainly comprised of Individuals 3 and 4. There might be a bit of batch effect, so you could consider using batch correction on this dataset. However, if we focus our attention on the other cluster - mature T-cells - where there is batch mixing, we can still assess this biologically even without batch correction.

Additionally, we will also look at the confounding effect of sex:

Hands On: Visualize Counts Split by Sex

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: DimPlot

“Group by”: RNA_nn_res.0.5

“Split by”: Sample.Characteristic.sex.

DimPlot colored by cluster and split by sex showing that there are fewere female cells (a product of fewer female samples being present in the data as a whole). — **Figure 20**: DimPlot colored by assigned clusters and split by Sex

We note that the one female sample - unfortunately one of merely three knockout samples - seems to be distributed in the same areas as the knockout samples at large, so luckily, this doesn’t seem to be a confounding factor and we can still learn from our data. Ideally, this experiment would be re-run with either more female samples all around or swapping out this female from the male sample.

Are there any clusters or differences being driven by sequencing depth, a technical factor?

Hands On: Plot Counts

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: nCount_RNA

FeaturePlot colored by counts showing no differences in sequencing depth across the dataset. — **Figure 21**: FeaturePlot colored by counts

There doesn’t visually appear to be any differences in sequencing depth across the clusters, but let’s check out some of those other variables we grouped by:

Hands On: Plot Features

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: nCount_RNA

“split.by”: Sample.Characteristic.individual.

FeaturePlot colored by counts & Split by Individual showing that Individual 3 seems to have been most deeply sequenced. — **Figure 22**: FeaturePlot colored by counts & split by individual

There we go! This might explain the dramatic shift in early to middle T-Cell between wildtype and knockout cells–the leftmost early to middle T-cells simply have a higher sequencing depth represented by Individual 3 (UMIs/cell) than the ones on the right side. Well, that explains some of the sub-cluster that we’re seeing in that splurge (specifically this likely accounts for the discernment between clusters 0, 4, 5, 6, and 7).

Luckily, and importantly, we don’t see the double negative or mature T-cells being similarly affected. So, although, this variable of sequencing depth, or moreso, Individual, might be something to regress out somehow, it doesn’t seem to be impacting our dataset such that we cannot draw meaningful insights.

The less you can regress/modify your data, in general, the better–you want to stay as true as you can to the raw data, and only use maths to correct your data when you really need to (and not to create insights where there are none!).

Do you think we processed these samples well enough? We have seen in the previous images that these clusters are not very tight or distinct, so we could consider stronger filtering. Let’s take a look at gene expression of a gene we know should not be expressed in tCells as a sanity check:

Hands On: Plot Hba-a1

Run Plot with Seurat ( Galaxy version 4.0.4+galaxy0) with the following parameters:

“RDS file”: Final Preprocessed Seurat Object (output of Seurat UMAP tool)

“Plot_type_selector”: FeaturePlot

“Features”: Hba-a1

FeaturePlot of Hba-a1 showing sparse expression across the whole dataset. — **Figure 23**: FeaturePlot of Hba-a1

Hemoglobin–a red blood cell marker that should NOT be found in T-cells–appears throughout the entire dataset in low numbers and as a likely marker of Cluster 6. This suggests that some background noise may have been introduced by the media the cells were in. We might consider in the wet lab trying to get a purer, happier sample, with less background or in the dry lab, we can take advantage of techniques such as SoupX or others to remove this technical noise.

Adjusting the filtering settings (increasing minimum counts/cell, etc.) is often the place to start in these scenarios.

Do you think the clustering is appropriate? i.e. are there single clusters that you think should be separate, and multiple clusters that could be combined?

Important to note, lest all bioinformaticians combine forces to attack the biologists: just because a cluster doesn’t look like a cluster by eye is NOT enough to say it’s not a cluster! But looking at the biology here, we struggled to find marker genes to distinguish the double positive populations, which we know are also affected by depth of sequencing. That’s a reasonable argument that Clusters 0, 4, 5, 6, and 7 might not be all that different. Maybe we need more depth of sequencing across all those cells, or to compare these explicitly to each other (consider variations on FindMarkers!).

However, the late double positive cluster is both seemingly leaving the larger body of clusters and also has fewer knockout cells, so we might go and look at what those cells are expressing in the marker genes. If we look at the mature T-cells further, we can see that their marker gene–Itm2a–is only expressed in half of the cluster. You might consider sub-clustering this to investigate further, either through changing the resolution or through analysing this cluster alone.

If we look at the differences between genotypes alone (so the pseudo-bulk), we can see that many, if not most, of the genes in that list are actually ribosomal. This could be a housekeeping background, it might be cell cycle related, it may be biological, or some combination of all three. You might consider investigating the cycling status of the cells, or even regressing this out (which is what the authors did).

Ultimately, there are quite a lot ways to analyse your single-cell data, both within the confines of this tutorial (the many parameters that could be changed throughout) and outside of it (batch correction, sub-clustering, cell-cycle scoring, inferred trajectories, etc.) Most analyses will still yield the same general output, though: there are fewer knockout cells in the mature T-cell population, suggesting some sort of abberant development of T-cells in the Igf2-p0 hets.

congratulations Congratulations! You have interpreted your plots in several important ways! You might want to consult your results with this control history, and check out the workflow for this tutorial.

You've Finished the Tutorial

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Trapnell, C., D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li et al., 2014 The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology 32: 381–386. 10.1038/nbt.2859
Satija, R., J. A. Farrell, D. Gennert, A. F. Schier, and A. Regev, 2015 Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33: 495–502. 10.1038/nbt.3192
McCarthy, D. J., K. R. Campbell, A. T. L. Lun, and Q. F. Wills, 2017 Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33: 10.1093/bioinformatics/btw777
Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523
Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19: 10.1186/s13059-017-1382-0

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Camila Goclowski, Filter, plot, and explore single cell RNA-seq data with Seurat (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-scrna-case_FilterPlotandExplore_SeuratTools,
author = "Camila Goclowski",
	title = "Filter, plot, and explore single cell RNA-seq data with Seurat (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Inferring single cell trajectories with Monocle3

slides Slides: GO Enrichment Analysis on Single-Cell RNA-Seq Data

tutorial Hands-on: GO Enrichment Analysis on Single-Cell RNA-Seq Data

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: retrieve_scxa
  owner: ebi-gxa
  revisions: 72b6e1747e37
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_filter_cells
  owner: ebi-gxa
  revisions: cd9f564bd63f
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_find_clusters
  owner: ebi-gxa
  revisions: d541e9e62beb
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_find_markers
  owner: ebi-gxa
  revisions: 07fc9f4841b3
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_find_neighbours
  owner: ebi-gxa
  revisions: ba897c57bdd9
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_find_variable_genes
  owner: ebi-gxa
  revisions: 948fc0276b4f
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_normalise_data
  owner: ebi-gxa
  revisions: 3f49bc9c9213
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_plot
  owner: ebi-gxa
  revisions: 95d79f1134f0
  tool_panel_section_label: HCA-Scanpy
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_read10x
  owner: ebi-gxa
  revisions: f5db0e853f57
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_run_pca
  owner: ebi-gxa
  revisions: 8ded486b0014
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_run_umap
  owner: ebi-gxa
  revisions: b9424c715a0d
  tool_panel_section_label: HCA-Scanpy
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seurat_scale_data
  owner: ebi-gxa
  revisions: 5e9ba303f9e1
  tool_panel_section_label: Seurat
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 1

September 2024

4 stars: Disliked: I think the image plotting is a bit tricky. Especially, FeaturePlot colored by counts & split by individual, the component graphs so thin