Pseudobulk Analysis with Decoupler and EdgeR
Author(s) |
![]() |
Editor(s) |
![]() |
Reviewers |
|
OverviewQuestions:
Objectives:
How does pseudobulk analysis help in understanding cell-type-specific gene expression changes?
What steps are required to prepare single-cell data (e.g., clustering, annotation, and metadata addition) for pseudobulk analysis?
How can we use pseudobulk data prepared with Decoupler to perform differential expression analysis using edgeR in Galaxy?
Requirements:
Understand the principles of pseudobulk analysis in single-cell data
Understand and generate the pseudobulk expression matrix with Decoupler
Perform differential expression analysis using edgeR
- Introduction to Galaxy Analyses
- slides Slides: Clustering 3K PBMCs with Scanpy
- tutorial Hands-on: Clustering 3K PBMCs with Scanpy
Time estimation: 4 hoursSupporting Materials:
- Datasets
- Workflows
- galaxy-history-answer Answer Histories
- FAQs
- video Recordings
- instances Available on these Galaxies
Published: Feb 12, 2025Last modification: May 9, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00514version Revision: 8
Pseudobulk analysis is a powerful technique that bridges the gap between single-cell and bulk RNA-seq data. It involves aggregating gene expression data from groups of cells within the same biological replicate, such as a mouse or patient, typically based on clustering or cell type annotations (Murphy and Skene 2022).
A key advantage of this approach in differential expression (DE) analysis is that it avoids treating individual cells as independent samples, which can underestimate variance and lead to inflated significance or overly optimistic p-values (Squair et al. 2021). This occurs because cells from the same biological replicate are inherently more similar to each other than cells from different samples. By grouping data into pseudobulk samples, the analysis aligns with the experimental design, as in bulk RNA-seq, leading to more reliable and robust statistical results (Murphy and Skene 2022).
Beyond enhancing statistical validity, pseudobulk analysis enables the identification of cell-type-specific gene expression and functional changes across biological conditions. It balances the detailed resolution of single-cell data with the statistical power of bulk RNA-seq, providing insights into the functional transcriptomic landscape relevant to biological questions. Overall, for DE analysis in multi-sample single-cell experiments, pseudobulk approaches demonstrate superior performance compared to single-cell-specific DE methods (Squair et al. 2021).
In this tutorial, we will guide you through a pseudobulk analysis workflow using the Decoupler (Mompel et al. 2022) and edgeR (Liu et al. 2015) tools available in Galaxy. These tools facilitate functional and differential expression analysis, and their output can be integrated with other Galaxy tools to visualize results, such as creating Volcano Plots, which we will also cover in this tutorial.
AgendaIn this tutorial, we will cover:
Get the Data!
Overview of the Data
The dataset used in this tutorial is an AnnData
object containing a subset of single-cell RNA-seq data, specifically including non-classical monocytes (NCMs) and plasmacytoid dendritic cells (pDCs) from bone marrow and peripheral blood samples. This subset was selected to illustrate differential gene expression between two immune cell types across two biologically distinct tissues. A portion of the data corresponding to bone marrow was derived from a published study Rettkowski et al. 2025.
In this tutorial, we will focus on these two cell populations and explore how their expression profiles differ between bone marrow and blood using pseudobulk analysis. The dataset includes three bone marrow samples and seven blood samples.
Pseudobulk analysis is an advanced approach in single-cell data analysis. It involves aggregating single-cell expression data by group (e.g., by cell type and condition) to perform bulk-style analyses while preserving single-cell resolution through metadata.
For this tutorial, we assume familiarity with standard single-cell data formats, such as AnnData
or Seurat
objects, and prior experience with single-cell workflows, particularly clustering and cell type annotation. We will use the decoupler
tool to perform the pseudobulk aggregation and downstream differential analysis Lab and Contributors.
If youâre new to these concepts, we recommend exploring the following tutorials before performing pseudobulk analysis:
- Clustering 3K PBMCs with Scanpy: Learn how to perform clustering and cell type annotation using Galaxyâs Scanpy and Anndata tool suites.
- Clustering 3K PBMCs with Seurat: Explore an alternative approach to clustering and annotation using Seurat within Galaxy.
- Combining single-cell datasets after pre-processing: Learn how to merge multiple single-cell datasets into one AnnData object and enrich it with experimental metadata.
The data object, which you will import from Zenodo into Galaxy via the provided link, has been preprocessed, analysed, and annotated. It includes the following key observation metadata:
- annotated: The assigned cell type.
- condition: The experimental condition or sample group.
- batch: The sample identifier.
- tissue: Indicates the tissue of origin (bone marrow or peripheral blood).
Data Upload
Hands On: Upload your data
- Create a new history for this tutorial and name it âPseudobulk DE Analysis with edgeRâ
Import the AnnData file from Zenodo:
https://zenodo.org/records/15275834/files/ncm_pdcs_subset.h5ad
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Libraries (left panel)
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material â> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- âSelect historyâ: the history you want to import the data to (or create a new one)
- Click on Import
- Rename the dataset: âAnnData for Pseudobulk.â
Ensure that the datatype is correct. It should be an AnnData object (
h5ad
).
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from âNew typeâ dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Before we dive into the analysis, lets inspect our AnnData file
with tools in galaxy to get familiar with the structure of the data.
Hands On: Inspect AnnData Object
- Inspect AnnData ( Galaxy version 0.10.9+galaxy1) tool with the following parameters:
- param-file Annotated data matrix:
AnnData for Pseudobulk
(Input dataset obtained > from Zenodo)`- âWhat to inspect?â:
General information about the object
Question
- How many cells and genes are there in the
AnnData
?- What other information can be retrive by the
Inspect AnnData
tool?
[n_obs x n_vars] 9118 Ă 18926
There are 9,118 observations, representing cells, and 18,926 variables, representing genes expressed in total.
- When selecting General Inspection, you can view the labels for all entries in
[obs]
,[var]
,[obsm]
,[varm]
, and[uns]
. This includes the labels described above within the observation field, such asannotated
,condition
,batch
, andtissue
, which will be needed as inputs for the pseudobulk tool. To view more specific details in the AnnData object, select a different parameter under âWhat to inspect?â.
Generation of the Pseudobulk Count Matrix with Decoupler
In this step, our goal is to perform a âbioinformatic cell sortingâ based on the annotated clusters of the single-cell data.
To start a pseudobulk analysis, ensure that the AnnData
object contains all the necessary metadata for pseudobulking. This includes key annotations such as annotated
or cell_type
, condition
, disease
, and batch
. These labels may vary depending on how the dataset was originally annotated.
Most importantly, the AnnData
object must include a layer containing the raw gene expression counts. This layer is essential for accurate aggregation and downstream differential expression analysis. Raw counts are allows to generate accurate pseudobulk aggregates. Since single-cell data is typically normalized after annotation, itâs important to preserve the raw counts in the AnnData object before normalization steps if you would like to perform a pseudobulk analysis later on. These raw counts are directly used by tools like Decoupler to generate the pseudobulk count matrix. Note that normalized count matrices should not be used with Decoupler, even if the tool appears to process them successfully.
If your AnnData object lacks raw counts, you can use the AnnData Operations tool on the Single Cell Galaxy instance. This tool allows you to copy the
.X
matrix into a new layer that you may assign a new label, for e.g.,counts
orraw_counts
, making it available as a parameter for Decoupler.Importantly, ensure that the copying of this matrix as a raw counts layer is done carefully and correctly. To verify that your AnnData object contains the necessary raw counts layer, you can use the Inspect AnnData Object tool. This tool helps confirm the presence of raw counts and other essential metadata in your AnnData object before proceeding with pseudobulk analysis.
Hands On: Decoupler Pseudobulk
- Decoupler pseudo-bulk ( Galaxy version 1.4.0+galaxy8) tool with the following parameters:
- param-file Input AnnData file:
AnnData for Pseudobulk
(Input dataset obtained > from Zenodo)- âProduce a list of genes to filter out per contrast?â:
No
- âObs Fields to Mergeâ: (Leave empty if not applicable)
- âGroupby columnâ:
annotated
(Column containing cell type annotations)- âSample Key columnâ:
batch
(Column containing individual sample identifiers)- âLayerâ:
raw_counts
(Layer containing raw gene expression counts)- âFactor Fieldsâ:
tissue
(Column inadata.obs
specifying experimental factors. For edgeR, the first field should be the main contrast field, followed by covariates.)- âUse Rawâ:
No
- âProduce AnnData with Pseudo-bulkâ: (Optional, if yes, it will generate an h5ad output file with pseudobulks)
- âMinimum Cells::
10
- âProduce plots::
Yes
- âMinimum Countsâ:
10
- âMinimum Total Countsâ:
1000
- âMinimum Counts Per Gene Per Contrast Fieldâ:
20
(Genes with fewer counts in specific contrasts are flagged in a separate file but not excluded from the results.)- âEnable Filtering by Expressionâ:
Yes
- âPlot Samples Figsizeâ:
13 13
- âPlot Filtering Figsizeâ:
13 13
Comment: Performing DEG within ClustersImportant! The count matrix retrieved from this tool includes all of our samples
batch
aggregated byannotated
, which can be identified by the column headers. If you want to perform comparisons of conditions within clusters of each individual cell type, you will need to subset the relevant columns of the matrix and use them as your new count matrix. We will demonstrate this in the last section of this tutorial with detailed hands-on steps.
Question
- How many outputs does the Decoupler tool generate?
- What do the outputs represent? What is the interpretation of the plots?
- Which output(s) will we use in edgeR for differential expression analysis?
- The Decoupler tool generates multiple outputs, including:
- Pseudobulk Count Matrix (tabular file)
- Samples Metadata (Factor file) (tabular file)
- Genes Metadata (tabular file)
- Pseudobulk Plot (PNG format)
- Filter by Expression Plot (PNG format)
- Genes to Ignore by Contrast Field (tabular file)
- Pseudobulk AnnData file (If chosen, h5ad file)
- The output files contain the following:
- Pseudobulk Count Matrix: Contains the raw count aggregates for each pseudobulk sample.
- Samples Metadata (Factor File): Provides metadata annotated for each sample, including factors or annotations added to the AnnData object.
- Genes Metadata: Includes gene-related information such as gene symbols, Ensembl IDs, dispersion values, etc.
- Genes to Ignore: Lists of genes that could be excluded or should be carefully considered for specific contrasts. This file contains a contrast field and the corresponding genes written as gene symbols.
- Pseudobulk Plot: A visual representation of the pseudobulk data.
- Filter by Expression Plot: Illustrates the expression filtering applied to the data.
- Pseudobulk AnnData file: An AnnData file that contains the aggregated pseudobulks.
- The pseudobulk count matrix is the primary input required for analysis using edgeR, a tool designed for differential expression analysis. The Samples Metadata is another file that will serve as an input for the edgeR tool.
Sanitation Steps - Part 1
The next steps will help you refine your data for easier handling. We will use standard galaxy tools, like: Replace Text, Remove columns and Text reformatting.
Hands On: Replace Text and Remove Columns
- Replace Text ( Galaxy version 9.3+galaxy1) with the following parameters:
- param-file âFile to processâ:
count_matrix
(output of Decoupler pseudo-bulk tool)- In âReplacementâ:
- param-repeat âInsert Replacementâ
- âFind patternâ:
[ --+*^]+
- âReplace with:â:
_
- Remove columns ( Galaxy version 1.0) with the following parameters:
- param-file âTabular fileâ:
genes_metadata
(output of Decoupler pseudo-bulk tool)- In âSelect Columnsâ:
- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
start
- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
end
- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
width
- Replace Text ( Galaxy version 9.3+galaxy1) with the following parameters:
- param-file âFile to processâ:
samples_metadata
(output of Decoupler pseudo-bulk tool)- In âReplacementâ:
- param-repeat âInsert Replacementâ
- âFind patternâ:
[ --+*^]+
- âReplace with:â:
_
- Replace Text ( Galaxy version 9.3+galaxy1) with the following parameters:
- param-file âFile to processâ:
outfile
(output of Replace Text: Sample Metadata Step tool)- In âReplacementâ:
- param-repeat âInsert Replacementâ
- âin columnâ:
c2
- âFind patternâ:
^([0-9])(.+)
- âReplace withâ:
GG_\\1\\2
In the previous steps, the following modifications were made to the files:
- Replace Text: Replaces special characters (
[ --+*^]+
) with underscores (_
) in thecount_matrix
file. Rename output to Count matrix.- Remove Columns: Removes the
start
,end
, andwidth
columns from thegenes_metadata
file. Rename output to Genes Metadata.- Replace Text: Replaces special characters (
[ --+*^]+
) with underscores (_
) in thesamples_metadata
file.Replace Text in Column: Modifies values in column
c2
ofoutfile from step 3
by prefixingGG_
to numbers at the beginning of the string. Rename output to Samples MetadataRegular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
literal . (period) {x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesnât stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you donât have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
Generating the Contrast File
This file will be used as the contrast input file in the edgeR tool.
Hands On: Creating a Contrast File for edgeR
- Text Reformatting ( Galaxy version 9.3+galaxy1) with the following parameters:
- param-file âFile to processâ:
Samples Metadata
(output generated from the Replace Text tool step).- âAWK Programâ:
BEGIN { print "header" } NR > 1 { if (!seen[$2]++) words[++count]=$2 } END { for (i=1; i<=count; i++) for (j=i+1; j<=count; j++) print words[i]"-"words[j] }
Rename to Contrast File
Comment: Explanation of the AWK ProgramThis AWK script performs the following:
- Initializes the
header
variable for the output file.- Processes the input file (
NR > 1
skips the header line).- Tracks unique elements in column 2 (
seen[$2]++
) to avoid duplicates.- Creates pairs of unique elements and calculates their difference, outputting the contrast for downstream edgeR analysis.
- Important! When using the âText reformatting with awkâ tool, make sure the code is written all in one line, without any line breaks. Otherwise, it wonât work as expected.
Question
- How does your contrast file look, and what contrast is generated by the tool?
- The contrast file is a simple tab-delimited text file. It contains:
- A header in the first row that labels the column.
- The contrast, such as
bonemarrow-pbmcs
, written in the second row.When working with your own data, the contrast file will look different, as it will reflect the specific contrasts in your dataset but the header should be the same
Differential Expression Analysis (DE) with edgeR
The tool edgeR is commonly used for differential expression (DE) analysis of pseudobulk aggregates in single-cell RNA-seq studies (Chen et al. 2024). It calculates gene dispersions based on the total counts for each gene and adjusts these estimates using an approach called empirical Bayes. This method improves reliability by pooling information across all genes in the dataset to stabilize dispersion estimates, particularly for genes with low counts or limited data. By âborrowing strengthâ from the behavior of other genes, empirical Bayes reduces the risk of overestimating or underestimating variability, ensuring more robust and accurate results (Chen et al. 2024). Differential expression is then evaluated using an exact test for the negative binomial distribution, which is similar to Fisherâs exact test but modified to handle the variability typically seen in RNA-seq data (Chen et al. 2024).
In addition to exact tests, edgeR employs generalized linear models (GLMs), which allow for the analysis of more complex experimental designs. GLMs model gene counts as a function of experimental conditions (e.g., treatment groups or time points) and estimate how these conditions influence gene expression. Likelihood-based methods, such as quasi-likelihood (QL) approaches, are central to this framework. Standard likelihood methods evaluate the fit of the model to the data, while quasi-likelihood methods add an extra layer by explicitly accounting for biological and technical variability, improving the reliability of DE results. These models allow edgeR to identify subtle expression differences while controlling for overdispersion in the data (Chen and Smyth 2016).
Several plots can be generated to assist in understanding the data and the results of the analysis, including Multidimensional scaling (MDS), Biological Coefficient of Variation plot (BCV), Quasi-Likelihood (QL), and Mean-Difference plot o MA (MD) plots. These visualizations provide insights into sample relationships, variability, and differential expression, and will be explained further in the tutorial. With these concepts in mind, letâs now perform our DE analysis using our edgeR tool in Galaxy!
Hands On: Run a DGE Analysis with edgeR
- edgeR ( Galaxy version 3.36.0+galaxy5) with the following parameters:
- âCount Files or Matrix?â:
Single Count Matrix
- param-file âCount Matrixâ:
Count Matrix
(output of Replace Text: Count Matrix tool)- âInput factor information from file?â:
Yes
- param-file âFactor Fileâ:
Samples Metadata
(output of Replace Text: Creating Factor File tool)- âUse Gene Annotations?â:
Yes
- param-file âGene Annotationsâ:
Genes Metadata
(output of Remove Columns: Gene Metadata tool)- âFormula for linear modelâ:
~ 0 + tissue
(Customize this formula based on your dataâs contrast names and factor file. Ensure the formula matches EdgeRâs syntax and uses only elements from the factor file.~ 0 + factor_Aor
~ 0 + factor_A + factor_B:factor_C)- âInput contrasts manually or through a file?â:
file
- param-file âContrasts Fileâ:
Contrast File
(output of Text Reformatting: Creating a Contrast File for edgeR tool)- In âFilter Low Countsâ:
- âFilter lowly expressed genes?â:
No
Comment: edgeR Tool OverviewUse this tool to perform DGE analysis using edgeR. It is highly customizable, allowing you to input your count matrix, factor information, and gene annotations prepared during earlier steps. The contrast file should be generated from the previous step, Text Reformatting: Creating a Contrast File for edgeR. The model formula should be defined based on your dataset and must include factors that were already specified in the contrast file.
Question
- What are the output(s) of the edgeR tool?
- How can we interpret our output result file?
- Output files:
- edgeR Tables: This file includes columns with information such as:
- Gene Symbol: The identifier for each gene.
- Log Fold Change (logFC): Represents the direction and magnitude of differential expression. Positive values indicate upregulation, and negative values indicate downregulation.
- Raw p-value (PValue): The unadjusted statistical significance value for each gene, indicating the likelihood of observing the result under the null hypothesis.
- False Discovery Rate (FDR): The adjusted p-value using the Benjamini-Hochberg correction method to account for multiple testing and control the expected proportion of false positives.
- (Optional): If additional outputs, such as ânormalized counts,â are selected in the edgeR tool, the table will also include these counts.
- edgeR Report: This is an HTML file containing various visualizations and summaries of the analysis, such as:
- MDS Plot (Multidimensional Scaling): Visualizes the relationships between samples based on gene expression profiles.
- BCV Plot (Biological Coefficient of Variation): Displays the estimated dispersion for each gene.
- QL Plot (Quasi-Likelihood): Shows the quasi-likelihood dispersions for the dataset.
- MD Plot (Mean-Difference Plot): Compares the mean expression of genes versus their log fold change.
- Interpreting the Results:
- The table contains the results of the differential expression analysis.
- The first column typically lists the gene symbols from the dataset.
- The logFC indicates the direction and magnitude of differential expression:
- Upregulated genes: Genes with higher expression in the first group of the contrast (e.g., âbone marrowâ group in a âbonemarrow-pbmcsâ contrast).
- Downregulated genes: Genes with lower expression in the first group of the contrast (e.g., âpbmcsâ group in a âbonemarrow-pbmcsâ contrast).
- The raw p-value represents the statistical significance of the result for each gene before adjustment for multiple comparisons. Lower values indicate stronger evidence against the null hypothesis.
- The FDR is the adjusted p-value, calculated using the Benjamini-Hochberg method, which helps control for false positives when testing many genes. Genes with an FDR below a threshold (e.g., 0.05) are considered statistically significant.
Plot Interpretations:
- MDS Plot: Displays relationships between samples based on gene expression profiles. Samples that cluster closely are more similar in their expression. Use this to identify whether samples separate by biological condition or to detect potential batch effects.
- BCV Plot: Shows the dispersion for each gene, with higher values indicating greater variability. This is useful for assessing how variability is modeled in the dataset.
- QL Plot: Highlights the quasi-likelihood dispersions, which represent variability modeled during statistical testing. Proper dispersion modeling ensures robust differential expression analysis.
- MD Plot: Visualizes the mean expression levels against log fold change for each gene. Genes far from the center indicate stronger differential expression, with points above or below the horizontal line showing upregulated or downregulated genes, respectively.
Sanitation Steps - Part 2
After performing the differential expression analysis with edgeR, we will clean the data to prepare it for visualization. This involves extracting dataset from a collection, removing unnecessary columns, standardizing text, and splitting the file if needed. We will use the Extract dataset, Remove columns ( Galaxy version 1.0).
Hands On
- Extract dataset with the following parameters:
- param-file âInput listâ:
Tables from edgeR data
(output from the edgeR tool tool)Extract dataset will allow us extract the table output from edgeR which was retrived as a collection.
- Remove columns ( Galaxy version 1.0) with the following parameters:
- param-file âTabular fileâ:
outTables
(output of edgeR tool)- In âSelect Columnsâ:
- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
X
(In other datasets, the header of the GeneID columns may look different)- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
logFC
- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
PValue
- param-repeat âInsert Select Columnsâ
- âHeader nameâ:
FDR
- âKeep named columnsâ:
Yes
- âRename output file toâ:
edgeR_DEG_summary
Remove columns to filter out unnecessary columns from the edgeR output and retain only the essential ones for analysis: Gene ID (
id
), Log Fold Change (logFC
), P-value (PValue
), and False Discovery Rate (FDR
).
Volcano Plot
In this step, we will use the sanitized output from the previous steps to generate a Volcano Plot, which visualizes the relationship between statistical significance (P-value) and fold change (LogFC) for differentially expressed genes (DEGs). The input file for the Volcano Plot must include four essential columns: FDR (adjusted P-value), P-value (raw), Log Fold Change, and Gene Symbols (Labels). As long as these columns are present, the Volcano Plot can be generated successfully.
Hands On: Create a Volcano Plot of the DEG
- Volcano Plot ( Galaxy version 0.0.6) with the following parameters:
- param-file âSpecify an input fileâ:
edgeR_DEG_summary
(tabular file output from Remove Columns tool)- âFile has header?â:
Yes
- âFDR (adjusted P value) column numberâ:
c4
- âP value (raw) column numberâ:
c3
- âLog Fold Change column numberâ:
c2
- âLabels column numberâ:
c1
- âLogFC threshold to colourâ:
0.58
- âPoints to labelâ:
Significant
- âOnly label top most significantâ:
40
- In âPlot Optionsâ:
- âPlot titleâ:
Differential Expression Volcano Plot
The Volcano Plot highlights genes with statistical significance (low P-values) and large fold changes. Genes meeting the significance thresholds are typically colored and labeled for easier identification.
What are the results of the Volcano Plot?
Letâs take a moment to interpret the Volcano Plot:
Question
- What is the significance of genes located at the extremes of the plot (e.g., high LogFC and low P-value)?
- How many genes meet the significance threshold in this analysis?
- What do the differentially expressed genes reveal about immune cells in bone marrow versus peripheral blood?
- Genes at the extremes of the volcano plot are highly significant and show strong differential expression. Investigating their biological roles may provide valuable insights into the underlying mechanisms.
- This analysis identified a substantial number of differentially expressed genes (DEGs), highlighting distinct expression profiles between immune cells from bone marrow and peripheral blood. Specifically, 188 genes were upregulated in bone marrow and 136 in PBMCs.
- These findings reveal tissue-specific immune cell behaviour, pointing to transcriptional specialization between bone marrow and peripheral blood compartments.
Subsetting Samples from the Original AnnData Object
In our initial analysis, we observed transcriptional differences between bone marrow and peripheral blood samples. However, since the dataset includes more than one immune cell type (pDCs and NCMs), itâs not immediately clear which population is primarily driving these differences.
To gain further insight, itâs helpful to perform pseudobulk analysis on each cell type separately. By subsetting the data by cluster and repeating the differential analysis (as we did with the full dataset at the start of this tutorial), we can better assess the individual contribution of each cell type.
In the following step, weâll demonstrate how to subset the AnnData object by cell type before reapplying the pseudobulk workflow. In the hands-on example, we will filter for pDCs. To extract NCMs instead, you can follow the exact same steps, but set the âValueâ parameter to Non_Classical_Monocyte
, which corresponds to the annotation of that cell type.
If you would like to extract all annotated clusters at once, for example to analyse each of them independently, refer to the tip box below titled âSplit AnnData object by cluster or other observation key into a collectionâ.
Extracting observations of interest as AnnData object
Hands On: Use Manipulate AnnData Tools to extract observations
- Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:
- âAnnotated data matrixâ:
AnnData for Pseudobulk
(your preprocessed, analysed, and annotated AnnData object)- âMethod used for filteringâ:
Filter on any column of observations or variables
- âWhat to filter?â:
Observations (obs)
(select this to filter cells)- âType of filtering?â:
By key (column) values
- âKey to filterâ:
annotated
(the column that contains the cell type annotations)- âType of value to filterâ:
Text
- âFilterâ:
equal to
- âValueâ:
pDC
(the cluster name for the cell type you want to extract)Comment: Pre-extracted clusters also availableIn addition to performing the filtering yourself, the
pDCs
andMonocytes_NC
clusters have also been pre-extracted and are available for direct download.
This allows you to jump ahead and run the downstream workflows independently for each cluster, as described later in this tutorial.Import the pre-filtered datasets from Zenodo:
https://zenodo.org/records/15275834/files/Monocytes_NC_subset.h5ad https://zenodo.org/records/15275834/files/pDCs_subset.h5ad
After using the Scanpy filter ( Galaxy version 1.10.2+galaxy3) to extract the cell type of interest, return to the Pseudobulk with Decoupler step at the beginning of this tutorial. You can now repeat the same steps using this smaller AnnData object, which contains only the selected cell type (e.g., pDCs). The resulting analysis will reveal differential gene expression between conditions (e.g., COVID-19 vs. healthy) for that specific cell type only.
Question
- What data is included in the new pseudobulk count matrix for the pDCs only? How is this matrix structured, and what do the column labels represent?
- How many samples are included in the pDCs AnnData? How many of these are from bone marrow and how many from peripheral blood?
- After performing pseudobulk analysis on pDCs and NCMs independently, how do the volcano plots compare? Do they reveal differentially expressed genes between tissues? If so, which cell type appears to drive the tissue-specific differences?
- The new count matrix includes the original 1,373 rows, each representing a gene, with gene identifiers listed in the first column. The remaining columns correspond to individual samples, such as bmacute_pDC and pbmcsacuteone_pDC.
- This dataset includes a total of eleven samples: three from bone marrow and eight from peripheral blood.
- The analysis of pDCs and NCMs reveals distinct patterns of differentially expressed genes (DEGs) between tissues. Based on the results, NCMs show stronger transcriptional differences and appear to be the main contributors to the tissue-specific expression patterns observed in the full dataset.
You can split an
AnnData
object into multiple objects at once based on the values of a givenobs
key using the: Manipulate AnnData ( Galaxy version 0.10.9+galaxy1).This tool automatically creates a collection containing one
AnnData
object for each unique value in the selected observation key.For example, to split the dataset by cluster annotation, you can use the key
louvain
, or in the case of this tutorial,annotated
. Each resulting element in the collection will correspond to a different cluster or cell type, which can then be analysed independently.
Recommendations
- Data Preprocessing:
- Ensure raw counts are preserved in your AnnData object before pseudobulk analysis.
- Verify the presence of important metadata fields to ensure meaningful grouping and contrasts.
- Focus on Subsets:
- Analyse subsets of cell types or clusters for more granular insights. For example, pseudobulk analysis focused on specific immune cell populations can reveal subtle but important expression changes.
- Quality Control:
- Always inspect the outputs of Decoupler (e.g., pseudobulk plots and filtering summaries) to confirm data integrity and address outliers.
- Use edgeRâs diagnostic plots (e.g., MDS, BCV) to check for batch effects or unexpected variability.
- Follow-Up Analysis:
- Consider functional analysis (e.g., GO enrichment, pathway analysis) for the differentially expressed genes to link expression changes to biological processes.
- Use visualization tools, such as heatmaps or boxplots, to further explore expression trends in key genes.
- Future Applications:
- Pseudobulk analysis can be extended to multi-condition experiments, time-course data, or other cell-type-specific investigations.