Single-cell ATAC-seq standard processing with SnapATAC2
Author(s) | Timon Schlegel |
Editor(s) | Pavankumar Videm Björn Grüning |
Tester(s) | Pavankumar Videm |
Reviewers |
OverviewQuestions:Objectives:
What does ATAC-seq data tell us about the cell?
Which steps are necessary to cluster the cells of single-cell ATAC-seq data?
Why is dimension reduction important for the analysis of single-cell data?
Requirements:
Learn how single-cell ATAC-seq data is processed
Create a count-matrix from a 10X fragment file
Perform filtering, dimension reduction and clustering on AnnData matrices
Generate and filter a cell-by-gene matrix
Identify marker genes for the clusters and annotate the cell types
- Introduction to Galaxy Analyses
- tutorial Hands-on: Pre-processing of 10X Single-Cell ATAC-seq Datasets
Time estimation: 4 hoursLevel: Intermediate IntermediateSupporting Materials:Published: Jul 11, 2024Last modification: Aug 9, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00445version Revision: 2
Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) analysis is a method to decipher the chromatin states of the analyzed cells. In general, genes are only expressed in accessible (i.e. “open”) chromatin and not in closed chromatin. By analyzing which genomic sites have an open chromatin state, cell-type specific patterns of gene accessibility can be determined. scATAC-seq is particularly useful for analyzing tissue containing different cell populations, such as peripheral blood mononuclear cells (PBMCs).
In this tutorial we will analyze scATAC-seq data using the tool suites SnapATAC2 (Zhang et al. 2024) and Scanpy (Wolf et al. 2018). With both of these tool suites we will perform preprocessing, clustering and identification of scATAC-seq datasets from 10x Genomics. The analysis will be performed using a dataset of PBMCs containing ~4,620 single nuclei.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!
Try it out!
- subdomain Europe: Single Cell Omics Lab
- subdomain USA: Single Cell Omics Lab
- subdomain Australia: Single Cell Omics Lab
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
- Open your Galaxy server
- Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
- Navigate to your tutorial
- Tool names in tutorials will be blue buttons that open the correct tool for you
- Note: this does not work for all tutorials (yet)
- You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
- We’ve had some issues with Tutorial mode on Safari for Mac users.
- Try a different browser if you aren’t seeing the button.
CommentThis tutorial is significantly based on “Standard pipeline” tutorial from SnapATAC2. It can be seen as the scATAC-seq counterpart to the scRNA-seq tutorial Clustering 3K PBMCs with Scanpy.
AgendaIn this tutorial, we will cover:
scATAC-seq with 10X Genomics
ATAC-seq utilizes a hyperactive Tn5 transposase (Kia et al. 2017) to ligate adaptors to genome fragments, created by the transposase. Performing ATAC-seq on individual cells used to be an expensive and time-consuming labour. The 10X Chromium NextGEM system made scATAC-seq a cost-effective method for gaining high-resolution data with a simple protocol. After the transposition of nuclei in bulk, individual nuclei are put into Gel beads in Emulsion (GEM), containing unique 10x cell barcodes and sequencing adaptors for Illumina sequencing.
Data
The 5k PBMCs dataset for this tutorial is available for free from 10X Genomics. The blood samples were collected from a healthy donor and were prepared following the Chromium Next GEM scATAC-seq protocol. After sequencing on Illumina NovaSeq, the reads were processed by the Cell Ranger ATAC 2.0.0 pipeline from 10X to generate a Fragments File.
The Fragments File is a tabular file in a BED-like format, containing information about the position of the fragments on the chromosome and their corresponding 10x cell barcodes.
SnapATAC2 requires 3 input files for the standard pathway of processing:
fragments_file.tsv
: A tabular file containing the chromosome positions of the fragments and their corresponding 10x cell barcodes.chrom_sizes.txt
: A tabular file of the number of bases of each chromosome of the reference genomegene_annotation.gtf.gz
: A tabular file listing genomic features of the reference genome (GENCODE GRCh38)
- A chromosome sizes file can be generated using the tool Compute sequence length ( Galaxy version 1.0.3).
- The reference genome can either be selected from cached genomes or uploaded to the galaxy history.
Comment
- This tutorial starts with a
fragment
file.- SnapATAC2 also accepts mapped reads in a
BAM
file.- To learn how to get a
fragment
file orBAM
file from raw.FASTQ
-reads, please check out the tutorial “Pre-processing of 10X Single-Cell ATAC-seq Datasets”- If you would like to start the analysis with a
BAM
file, you can expand the details section “Details: Creating a fragment file”.
Get Data
Hands-on: Data upload
- Create a new history for this tutorial
Import the
fragments_file.tsv
,chrom_sizes.txt
andgene_annotation.gtf.gz
from Zenodo or from the shared data libraryhttps://zenodo.org/records/12707159/files/atac_pbmc_5k_nextgem_fragments.tsv.gz https://zenodo.org/records/12707159/files/chrom_sizes.txt https://zenodo.org/records/12707159/files/gencode.v46.annotation.gtf.gz
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- “Select history”: the history you want to import the data to (or create a new one)
- Click on Import
- Rename the datasets
- galaxy-pencil Rename the file
atac_pbmc_5k_nextgem_fragments.tsv
tofragments_file.tsv
- galaxy-pencil Rename the file
gencode.v46.annotation.gtf.gz
togene_annotation.gtf.gz
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
- Inspect
chrom_sizes
andfragments_file
Question
- How many chromosomes are in
chrom_sizes
?- In which column are the cell barcodes stored in the
fragments_file
?
- There are 25 chromosomes. The 22 autosomes (Chr. 1-22), both sex chromosomes (Chr. X and Y) and the small circular mitochondrial chromosome (Chr. M).
- The cell barcodes are unique 16 bp oligos, located in column 4.
Creating a fragment file
Hands-on: fragment file
- Import the file
BAM_500-PBMC
from Zenodo or from the shared data libraryhttps://zenodo.org/records/12707159/files/BAM_500-PBMC.bam
- This dataset contains mapped reads in the
BAM
format.- It was generated by following the tutorial “Pre-processing of 10X Single-Cell ATAC-seq Datasets” until the output of Map with BWA-MEM ( Galaxy version 0.7.18)
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Convert a BAM file to a fragment file, using 'pp.make_fragment_file'
- param-file “File name of the BAM file”:
BAM_500-PBMC
(Input dataset)- param-toggle “Indicate whether the BAM file contain paired-end reads”:
Yes
- “How to extract barcodes from BAM records?”:
From read names using regular expressions
- “Extract barcodes from read names of BAM records using regular expressions”:
(................):
Comment
- Not every regular expression type is supported.
- This expression selects 16 characters if they are followed by a colon. Only the cell barcodes of the
BAM
file will match.- Rename the generated file to
Fragments 500 PBMC
- Now you can continue with either the
fragments_file
from earlier or the new fileFragments 500 PBMC
.
- galaxy-info The tool
pp.make_fragment_file
tool has implemented additional quality control (QC) measures. These filter out larger fragments, which will be noticeable in the log-scale fragment size distribution.- galaxy-info Please note that
Fragments 500 PBMC
only contains 500 PBMCs and thus the clustering will produce different outputs compared to the outputs generated byfragments_file
(with 5k PBMC).
Preprocessing
Preprocessing of the scATAC-seq data contained in the fragment
file with SnapATAC2 begins with importing the files and computing basic QC metrics.
SnapATAC2 compresses and stores the fragments into an AnnData
object.
AnnData
The AnnData
format was initially developed for the Scanpy
package and is now a widely accepted data format to
store annotated data matrices in a space-efficient manner.
Import files to SnapATAC2
Hands-on: Create an AnnData object
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Import data fragment files and compute basic QC metrics, using 'pp.import_data'
- param-file “Fragment file, optionally compressed with gzip or zstd”:
fragments_file.tsv
(Input dataset)- param-file “A tabular file containing chromosome names and sizes”:
chrom_sizes.txt
(Input dataset)- param-toggle “Whether the fragment file has been sorted by cell barcodes”:
No
- This tool requires the fragment file to be sorted according to cell barcodes.
- If pp.make_fragment_file tool was used to generate the fragment file, this has automatically been done.
- Otherwise, the setting “sorted by cell barcodes” should remain
No
.Rename the generated file to
Anndata 5k PBMC
- Check that the format is
h5ad
Because the AnnData
format is an extension of the HDF5 format, i.e. a binary format, an AnnData
object can not be inspected directly in Galaxy by clicking on the galaxy-eye (View data) icon.
Instead, we need to use a dedicated tool from the AnnData suite.
Hands-on: Inspect an AnnData object
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Anndata 5k PBMC
- “What to inspect?”:
General information about the object
galaxy-eye Inspect the generated file
QuestionAnnData object with n_obs × n_vars = 14232 × 0 obs: 'n_fragment', 'frac_dup', 'frac_mito' uns: 'reference_sequences' obsm: 'fragment_paired'
- How many observations are there? What do they represent?
- How many variables are there?
- There are 14232 observations, representing the cells.
- There are 0 variables, representing genomic regions. This is because genome-wide 500-bp bins are only added after initial filtering.
- Many toolsets producing outputs in AnnData formats in Galaxy, provide the general information by default:
- Click on the name of the dataset in the history to expand it.
General Anndata information will be given in the expanded box:
e.g.
[n_obs x n_vars] - 14232 × 0
details This feature isn’t the most reliable and might display:
[n_obs x n_vars] - 1 × 1
- In such cases and for more specific queries, Inspect AnnData ( Galaxy version 0.10.3+galaxy0) is required.
Calculate and visualize QC metrics
Hands-on: Fragment-size distribution
- SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for plotting”:
Plot fragment size distribution, using 'pl.frag_size_distr'
- param-file “Annotated data matrix”:
Anndata 5k PBMC
(output of pp.import_data tool)- galaxy-eye Inspect the
.png
output- SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for plotting”:
Plot fragment size distribution, using 'pl.frag_size_distr'
- param-file “Annotated data matrix”:
Anndata 5k PBMC
(output of pp.import_data tool)- “Change the y-axis (fragment counts) to log scale”:
Yes
- galaxy-eye Inspect the
.png
outputQuestion
- What distinct features do the plots have? And what do they represent?
- Which fragments are generally from open chromatin?
- 3 peaks are clearly visible (at <100-bp, ~200-bp and ~400-bp). The smallest fragments are from nucleosome-free regions, while the larger peaks of 200- and 400-bp contain mono- and di-nucleosome fragments, respectively.
- The small fragments (<100-bp) are from open chromatin reads, since the Tn5 transposase could easily access the loosely packed DNA (Yan et al. 2020).
The transcription start site enrichment (TSSe) is another important QC metric. Nucleosome-free fragments are expected to be enriched at transcription start sites (TSS). TSSe shows increased fragmentation of chromatin around the TSS. This suggests open and accessible nucleosome-free chromatin.
TSSe is used as a QC metric, since an increased enrichment around TSS regions suggests that the experiment has captured biological meaningful genomic features. TSSe scores of individual cells can be calculated using SnapATAC2’s metrics.tsse function.
Hands-on: Calculate and Plot TSSe
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Compute the TSS enrichment score (TSSe) for each cell, using 'metrics.tsse'
- param-file “Annotated data matrix”:
Anndata 5k PBMC
(output of pp.import_data tool)- param-file “GTF/GFF file containing the gene annotation”:
gene_annotation.gtf.gz
(Input dataset)Rename the generated file to
Anndata 5k PBMC TSSe
or add the tag galaxy-tagsTSSe
to the dataset:Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
- SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for plotting”:
Plot the TSS enrichment vs. number of fragments density figure, using 'pl.tsse'
- param-file “Annotated data matrix”:
Anndata 5k PBMC TSSe
(output of metrics.tsse tool)- galaxy-eye Inspect the
.png
outputHigh-quality cells can be identified in the plot of TSSe scores against a number of unique fragments for each cell.
Question
- Where are high-quality cells located in the plot?
- Based on this plot, how should the filter be set?
- The cells in the upper right are high-quality cells, enriched for TSS. Fragments in the lower left represent low-quality cells or empty droplets and should be filtered out.
- Setting the minimum number of counts at 5,000 and the minimum TSS enrichment to 10.0 is an adequate filter.
Filtering
Based on the TSSe plot the cells can be filtered by TSSe and fragment counts.
Hands-on: Filter cells
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'
- param-file “Annotated data matrix”:
Anndata 5k PBMC TSSe
(output of metrics.tsse tool)- “Minimum number of counts required for a cell to pass filtering”:
5000
- “Minimum TSS enrichemnt score required for a cell to pass filtering”:
10.0
- “Maximum number of counts required for a cell to pass filtering”:
100000
- Rename the generated file to
Anndata 5k PBMC TSSe filtered
or add the tag galaxy-tagsfiltered
to the datasetgalaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4564 × 0 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse' uns: 'reference_sequences' obsm: 'fragment_paired'
- How have the observations changed, compared to the first
Anndata 5k PBMC
AnnData file?- What does this tell us about the quality of the data?
There are only 4564 observations, compared to the initial 14232 observations.
And the
obs: 'tsse'
has been added (but already during metrics.tssetool)The empty droplets and low-quality cells have been filtered out, leaving us with 4564 high-quality cells.
Feature selection
Currently, our AnnData matrix does not contain any variables. The variables will be added in the following step with the function pp.add_tile_matrix. This creates a cell-by-bin matrix containing insertion counts across genome-wide 500-bp bins.
After creating the variables, the most accessible features are selected.
Hands-on: Select features
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Generate cell by bin count matrix, using 'pp.add_tile_matrix'
- param-file “Annotated data matrix”:
Anndata 5k PBMC TSSe filtered
(output of pp.filter_cells tool)- “The size of consecutive genomic regions used to record the counts”:
500
- “The strategy to compute feature counts”:
"insertion": based on the number of insertions that overlap with a region of interest
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Perform feature selection, using 'pp.select_features'
- param-file “Annotated data matrix”:
Anndata tile_matrix
(output of pp.add_tile_matrix tool)- “Number of features to keep”:
250000
Comment: Select features
- Including more features improves resolution and can reveal finer details, but it may also introduce noise.
- To optimize results, experiment with the
n_features
parameter to find the most appropriate value for your dataset.
- To demonstrate the differences when selecting features, the following UMAP plots are the outputs from processing with a number of features between 1,000 and 500,000.
- Fewer features result in fewer, but larger clusters. Selecting a lot of features will output more granular clusters and the compute time will increase.
- At this step you can provide a blacklist or whitelist to specifically select relevant features.
- For example the ENCODE Blacklist (Amemiya et al. 2019) can be applied here.
- Rename the generated file to
Anndata 5k PBMC select_features
or add the tag galaxy-tagsselect_features
to the datasetgalaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4564 × 6062095 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse' var: 'count', 'selected' uns: 'reference_sequences' obsm: 'fragment_paired'
- How did
n_vars
change compared toAnndata 5k PBMC TSSe filtered
- Where are the selected features stored in the count matrix?
- There are 6,062,095 variables, compared to 0 in
TSSe filtered
.- Selected features are stored in
var: 'selected'
.
Doublet removal
Doublets are removed by calling a customized scrublet (Wolock et al. 2019) algorithm. pp.scrublet will identify potential doublets and the function pp.filter_doublets removes them.
- During single-cell sequencing, multiple nuclei can be encapsulated into the same 10x gel bead. The resulting multiplets (>97% of which are doublets) produce sequences from both cells.
- Doublets can confound the results by appearing as “new” clusters or artifactual intermediary cell states.
- These problematic doublets are called neotypic doublets, since they appear as “new” cell types.
- Scrublet (Single-cell Remover of Doublets) is an algorithm which can detect neotypic doublets that produce false results.
- The algorithm first simulates doublets by combining random pairs of observed cell features.
- The observed features of the “cells” are then compared to the simulated doublets and scored on their doublet probability.
- SnapATAC2’s pp.filter_doublets then removes all cells with a doublet probability >50%.
Hands-on: Scrublet
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Compute probability of being a doublet using the scrublet algorithm, using 'pp.scrublet'
- param-file “Annotated data matrix”:
Anndata 5k PBMC select_features
(output of pp.select_features tool)- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Remove doublets according to the doublet probability or doublet score, using 'pp.filter_doublets'
- param-file “Annotated data matrix”:
Anndata scrublet
(output of pp.scrublet tool)- “Threshold for doublet probability”:
0.5
- Rename the generated file to
Anndata 5k PBMC filter_doublets
or add the tag galaxy-tagsfilter_doublets
to the datasetgalaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4430 × 6062095 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score' var: 'count', 'selected' uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score' obsm: 'fragment_paired'
- What was removed by pp.filter_doublets?
- Where are the new annotations stored?
- Cell doublets were removed.
n_obs
was reduced from 4564 to 4430 cells.- The outputs of pp.scrublet are stored in observations
obs: 'doublet_probability', 'doublet_score'
and in unstructured annotationsuns: 'scrublet_sim_doublet_score'
. The output of pp.filter_doublets is stored inuns: 'doublet_rate'
.
Dimension reduction
Dimension reduction is a very important step during the analysis of single cell data. During this, the complex multi-dimensional data is projected into lower-dimensional space, while the lower-dimensional embedding of the complex data retains as much information as possible. Dimension reduction enables batch correction, data visualization and quicker downstream analysis since the data is more simplified and the memory usage is reduced (Zhang et al. 2024).
- Dimension reduction algorithms can be either linear or non-linear.
Linear methods are generally computationally efficient and well scalable.
A popular linear dimension reduction algorithm is:
- PCA (Principle Component Analysis), implemented in Scanpy (please check out our Scanpy tutorial for an explanation).
- Nonlinear methods however are well suited for multimodal and complex datasets.
- in contrast to linear methods, which often preserve global structures, non-linear methods have a locality-preserving character.
- This makes non-linear methods relatively insensitive to outliers and noise while emphasizing natural clusters in the data (Belkin and Niyogi 2003)
- As such, they are implemented in many algorithms to visualize the data in 2 dimensions (f.ex. UMAP embedding).
- The nonlinear dimension reduction algorithm, through matrix-free spectral embedding, used in SnapATAC2 is a very fast and memory efficient non-linear algorithm (Zhang et al. 2024).
- Spectral embedding utilizes an iterative algorithm to calculate the spectrum (eigenvalues and eigenvectors) of a matrix without computing the matrix itself.
- For a simple introduction into spectral embedding and how it compares to PCA, please check out the blog post “On Laplacian Eigenmaps for Dimensionality Reduction” by Juan Orduz.
Spectral embedding
The dimension reduction, produced by the algorithm tl.spectral, is required for later steps, such as plotting and clustering.
Hands-on: Spectral embedding
- SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Dimension reduction and Clustering”:
Perform dimension reduction using Laplacian Eigenmap, using 'tl.spectral'
- param-file “Annotated data matrix”:
Anndata 5k PBMC filter_doublets
(output of pp.filter_doublets tool)- “Distance metric”:
cosine
Comment: Distance metric
- The fast and well scalable “matrix-free spectral embedding” algorithm depends on the distance metric:
cosine
- Rename the generated file to
Anndata 5k PBMC spectral
or add the tag galaxy-tagsspectral
to the datasetgalaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4430 × 6062095 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score' var: 'count', 'selected' uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue' obsm: 'fragment_paired', 'X_spectral'
Where are the new annotations stored?
The outputs of tl.spectral are stored in unstructured annotations
uns: 'spectral_eigenvalue'
and as multidimensional observationsobsm: 'X_spectral'
.
UMAP embedding
With the already reduced dimensionality of the data stored in X_spectral
, the cells can be further embedded (i.e. transformed into lower dimensions) with Uniform Manifold Approximation and Projection (UMAP).
UMAP projects the cells and their relationship to each other into 2-dimensional space, which can be easily visualized (McInnes et al. 2018).
Hands-on: UMAP embedding
- SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Dimension reduction and Clustering”:
Compute Umap, using 'tl.umap'
- param-file “Annotated data matrix”:
Anndata 5k PBMC spectral
(output of tl.spectral tool)- “Use the indicated representation in ‘.obsm’“:
X_spectral
- Rename the generated file to
Anndata 5k PBMC UMAP
or add the tag galaxy-tagsUMAP
to the dataset
Clustering
During clustering, cells that share similar accessibility profiles are organized into clusters. SnapATAC2 utilizes graph-based community clustering with the Leiden algorithm (Traag et al. 2019). This method takes the k-nearest neighbor (KNN) graph as input data and produces well-connected communities.
Community clustering
Hands-on: Clustering analysis
- SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Dimension reduction and Clustering”:
Compute a neighborhood graph of observations, using 'pp.knn'
- param-file “Annotated data matrix”:
Anndata 5k PBMC UMAP
(output of tl.umap tool)- SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Dimension reduction and Clustering”:
Cluster cells into subgroups, using 'tl.leiden'
- param-file “Annotated data matrix”:
Anndata knn
(output of pp.knn tool)- “Whether to use the Constant Potts Model (CPM) or modularity”:
modularity
Comment: CPM or modularity
- make sure you selected
modularity
- the clusters produced by
CPM
are not represented well in the UMAP projections- Rename the generated file to
Anndata 5k PBMC leiden
or add the tag galaxy-tagsleiden
to the datasetgalaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4564 × 6062095 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden' var: 'count', 'selected' uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue' obsm: 'X_spectral', 'X_umap', 'fragment_paired' obsp: 'distances'
Where are the leiden clusters stored in the AnnData?
The clusters are stored in
obs: 'leiden'
Plotting of clusters
Now that we have produced UMAP embeddings of our cells and have organized the cells into leiden clusters, we can now visualize this information with pl.umap.
Hands-on: Plotting the clusters
- SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for plotting”:
Plot the UMAP embedding, using 'pl.umap'
- param-file “Annotated data matrix”:
Anndata 5k PBMC leiden
(output of tl.leiden tool)- “Color”:
leiden
- “Height of the plot”:
500
- galaxy-eye Inspect the
.png
outputQuestion
- How many leiden clusters were discovered?
- What does the distance of clusters to each other tell us about their chromatin states?
- There are 13 leiden clusters.
- Clusters in close proximity (f.ex. clusters 0 and 5) share a similar chromatin accessibility profile (and very likely also a similar cell type), compared to a cluster further away (f.ex. cluster 9).
Cell cluster annotation
After clustering the cells, they must be annotated. This categorizes the clusters into known cell types. Manual Cell Annotation requires known marker genes and varying expression profiles of the marker genes among clusters.
Luckily, the marker genes for PBMCs are known (Sun et al. 2019). Marker genes for other single cell datasets can also be found in databases such as PanglaoDB (Franzén et al. 2019). Using the known marker genes we can now annotate our clusters manually.
Gene matrix
Since our data currently doesn’t contain gene information, we have to create a cell-by-gene activity matrix using the function pp.make_gene_matrix.
Hands-on: Task description
- SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
- “Method used for preprocessing”:
Generate cell by gene activity matrix, using 'pp.make_gene_matrix'
- param-file “Annotated data matrix”:
Anndata 5k PBMC leiden
(output of tl.leiden tool)- param-file “GTF/GFF file containing the gene annotation”:
gene_annotation.gtf.gz
(Input dataset)- Rename the generated file to
Anndata 5k PBMC gene_matrix
or add the tag galaxy-tagsgene_matrix
to the dataset
- Please note that pp.make_gene_matrix removes all annotations except those stored in
obs
.- Therefore it might be necessary to remove propagating tags galaxy-tags (tags starting with
#
) fromAnndata 5k PBMC gene_matrix
.
- Tags can be removed by expanding the dataset with a tag and clicking galaxy-cross next to the tag.
galaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4430 × 60606 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden'
What does
n_vars
represent inAnndata 5k PBMC gene_matrix
and what did it represent inAnndata 5k PBMC leiden
?
- The variables now represent accessible genes. There are 60606 accessible genes in our samples. In
Anndata 5k PBMC leiden
and all earlier AnnData the variables represented the 6062095 fixed-sized genomic bins.
Imputation with Scanpy and MAGIC
Similar to scRNA-seq data, the cell-by-gene-activity matrix is very sparse. Additionally, high gene variance between cells, due to technical confounders, could impact the downstream analysis. In scRNA-seq, filtering and normalization are therefore required to produce a high-quality gene matrix.
Since the cell-by-gene-activity matrix resembles the cell-by-gene-expression matrix of scRNA-seq, we can use the tools of the Scanpy (Wolf et al. 2018) tool suite to continue with our data.
- The count matrices of single-cell data are sparse and noisy.
- Confounding issues, such as “dropout” effects, where some mRNA or DNA-segments are not detected although they are present in the cell, also result in some cells missing important cell-type defining features.
- These problems can obscure the data, as only the strongest gene-gene relationships are still detectable.
- The Markov Affinity-based Graph Imputation of Cells (MAGIC) algorithm (van Dijk et al. 2018) tries to solve these issues by filling in missing data from some cells with transcript information from similar cells.
- The algorithm calculates the likely gene expression of a single cell, based on similar cells and fills in the missing data to produce the expected expression.
- MAGIC achieves this by building a graph from the data and using data diffusion to smooth out the noise.
Hands-on: Filter and normalize
- Filter ( Galaxy version 1.9.6+galaxy3) with the following parameters:
- param-file “Annotated data matrix”:
Anndata 5k PBMC gene_matrix
(output of pp.make_gene_matrix tool)- “Method used for filtering”:
Filter genes based on number of cells or counts, using 'pp.filter_genes'
- param-select “Filter”:
Minimum number of cells expressed
- “Minimum number of cells expressed required for a gene to pass filtering”:
5
- Normalize ( Galaxy version 1.9.6+galaxy3) with the following parameters:
- param-file “Annotated data matrix”:
Anndata filter_genes
(output of Filter tool)- “Method used for normalization”:
Normalize counts per cell, using 'pp.normalize_total'
- param-toggle “Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell”:
No
- Inspect and manipulate ( Galaxy version 1.9.6+galaxy3) with the following parameters:
- param-file “Annotated data matrix”:
Anndata normalize
(output of Normalize tool)- “Method used for inspecting”:
Logarithmize the data matrix, using 'pp.log1p'
- Normalize ( Galaxy version 1.9.6+galaxy3) with the following parameters:
- param-file “Annotated data matrix”:
Anndata log1p
(output of log1p tool)- “Method used for normalization”:
Denoising using Markov Affinity-based Graph Imputation of Cells (MAGIC) API 'external.pp.magic'
- “Denoised genes to return”:
All genes
- “Which solver to use”:
"approximate", is faster that performs imputation in the PCA space and then projects back to the gene space
Comment:
- Choosing the setting
Which solver to use: 'exact'
will result in a output file with better resolution.- This is not necessary for our purposes, since the compute time also increases with this setting.
Rename the generated file to
Anndata 5k PBMC magic
or add the tag galaxy-tagsmagic
to the datasetgalaxy-eye Inspect the general information of the
.h5ad
outputQuestionAnnData object with n_obs × n_vars = 4430 × 55106 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden', 'n_genes', 'n_counts' var: 'n_cells', 'n_counts' uns: 'log1p'
- How did
n_vars
change, compared toAnndata 5k PBMC gene_matrix
?- Which data was ‘lost’, compared to
Anndata 5k PBMC leiden
, and must be added to the file in order to produce UMAP plots?
- The number of genes was reduced from 60606 to 55106 by the filtering. Additional annotations were added, such as:
obs: 'n_genes', 'n_counts'
,var: 'n_cells', 'n_counts'
anduns: 'log1p'
.- The UMAP embeddings
obsm: 'X_umap'
are missing and should be added to the Anndata in the next step. WithoutX_umap
it won’t be possible to visualize the plots.
Copy-over embeddings
Hands-on: Copy UMAP embedding
- AnnData Operations ( Galaxy version 1.9.3+galaxy0) with the following parameters:
- param-file “Input object in hdf5 AnnData format”:
Anndata 5k PBMC magic
(output of external.pp.magic tool)- “Copy embeddings (such as UMAP, tSNE)”:
Yes
- “Keys from embeddings to copy”:
X_umap
- param-file “IAnnData objects with embeddings to copy”:
Anndata 5k PBMC leiden
Comment: Annotations to copy
- This tutorial only focuses on producing an UMAP plot with marker-genes.
- If further analysis, with tools requiring more annotations, is intended, these can be added in a similar way as shown above.
- f.ex. Peak and Motif Analysis with Snapatac2 peaks and motif ( Galaxy version 2.6.4+galaxy1) requires annotations from
uns
.- It is also possible to leave the input “Keys from embeddings to copy” empty, to copy all annotations of a given category such as
obsm
.Rename the generated file to
Anndata 5k PBMC magic UMAP
or add the tag galaxy-tagsUMAP
to the dataset- galaxy-eye Inspect the general information of the
.h5ad
output, to check ifobsm
containsX_umap
AnnData object with n_obs × n_vars = 4430 × 55106 obs: obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden', 'n_genes', 'n_counts' var: 'n_cells', 'n_counts' uns: 'log1p' obsm: 'X_umap'
Visualize gene activity of marker genes
The gene activity of selected marker genes can now be visualized with Scanpy.
Hands-on: Plot marker genes
- Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:
- param-file “Annotated data matrix”:
output_h5ad
(output of AnnData Operations tool)- “Method used for plotting”:
Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
- “Keys for annotations of observations/cells or variables/genes”:
leiden, CD3D, CD8A, CD4, MS4A1, NKG7, CD14, FCER1A
- param-toggle “Show edges?”:
No
- In “Plot attributes”
- “Number of panels per row”:
2
Question
- Are the marker genes selectively expressed in the clusters?
- Which marker genes have overlapping expression profiles? And what could that imply?
- Some marker genes, such as
MS4A1
orCD8A
, are only expressed in a few clusters (clusters 6+11 and clusters 1+8, respectively).- The marker gene
CD3D
is expressed in multiple clusters (1, 2, 4, 7 and 8). Overlapping expression profiles imply similar cell types since similar cell types have similar marker genes upregulated. In this case,CD3D
expression classifies the cells in these clusters as T-cells.
Manual cluster annotation
Comparison of marker gene expression in our clusters with a table of canonical marker genes (Sun et al. 2019), enables us to annotate the clusters manually.
Cell type | Marker genes |
---|---|
CD8+ T cells | CD3D+, CD8A+, CD4- |
CD4+ T cells | CD3D+, CD8A-, CD4+ |
B cells | CD3D-, MS4A1+ |
Natural killer (NK) cells | CD3D-, NKG7+ |
Monocytes | CD3D-, CD14+ |
Dendritic cells | CD3D-, FCER1A+ |
These canonical marker genes can match the clusters to known cell types:
Cluster | Cell type |
---|---|
0 | Monocytes |
1 | CD8+ T cells |
2 | CD4+ T cells |
3 | Monocytes |
4 | CD4+ T cells |
5 | Monocytes |
6 | B cells |
7 | CD4+ T cells |
8 | CD8+ T cells |
9 | NK cells |
10 | Monocytes |
11 | B cells |
12 | Dendritic cells |
CommentNote that some clusters contain subtypes (f.ex. the annotated B cell clusters contain both naive and memory B cells). The cell-type annotation can be refined by choosing more specific marker genes.
To manually annotate the Leiden clusters, we will need to perform multiple steps:
- Inspect the key-indexed observations of
Anndata 5k PBMC gene_matrix magic UMAP
- Cut the Leiden annotations out of the table
- Upload a replace file containing the new cell type annotations for the Leiden clusters
- Replace the values of the cluster annotation with cell type annotation
- Add the cell type annotation to the AnnData
- Plot the annotated cell types
Hands-on: Manual annotation
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Anndata 5k PBMC gene_matrix magic UMAP
- “What to inspect?”:
Key-indexed observations annotation
- Rename the generated file to
5k PBMC observations
galaxy-eye Inspect the generated file
QuestionIn which column is the Leiden annotation located?
The Leiden annotation is in column 8.
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9 Column 10 ”” n_fragment frac_dup frac_mito tsse doublet_probability doublet_score leiden n_genes n_counts AAACGAAAGACGTCAG-1 22070 0.5219425551271499 0.0 30.43315066436454 0.004634433324822066 0.009276437847866418 8 52303 16521.599844068267 AAACGAAAGATTGACA-1 10500 0.5345125681606597 0.0 29.10551296093465 0.004668403569267374 0.001088139281828074 1 54501 15020.42495602328 AAACGAAAGGGTCCCT-1 19201 0.5101785714285714 0.0 19.90011850347046 0.004634433324822066 0.009276437847866418 5 54212 16294.751533305309 AAACGAACAATTGTGC-1 13242 0.487399837417257 0.0 29.060913705583758 0.004660125753854076 0.0022172949002217295 7 53530 15456.629863655084 - Cut columns ( Galaxy version 1.0.2) with the following parameters:
- param-select “Cut columns”:
c8
- param-file “From”:
5k PBMC observations
(output of Inspect AnnData tool)- Create a new tabular file from the following
leiden cell_type 0 Monocytes 1 CD8_Tcells 2 CD4_Tcells 3 Monocytes 4 CD4_Tcells 5 Monocytes 6 Bcells 7 CD4_Tcells 8 CD8_Tcells 9 NKcells 10 Monocytes 11 Bcells 12 Dendritic_cells
- Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data at the bottom
Paste the file contents into the text field
- Change the dataset name from “New File” to
replace_file
- Change Type from “Auto-detect” to
tabular
* From the Settings menu (galaxy-gear) select Convert spaces to tabs* Press Start and Close the window
- The first column of the replace file contains the “old” annotations and the second column contains the “new” annotation.
- warning Spaces in the new annotations can lead to errors. Please use underscores (
_
) instead.- Replace column ( Galaxy version 0.2) with the following parameters:
- param-file “File in which you want to replace some values”:
Cut columns leiden
(output of Cut columns tool)- param-file “Replace information file”:
replace_file
- “Which column should be replaced?:
Column: 1
- Rename the generated file to
Cell type annotation
and galaxy-eye inspect the file to check if the replacement was successful
- check if the datatype is set to
tabular
you may need to change the datatype manually
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
tabular
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
- Manipulate AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Anndata 5k PBMC gene_matrix magic UMAP
- “Function to manipulate the object”:
Add new annotation(s) for observations or variables
- “What to annotate”:
Observations (obs)
- param-file “Table with new annotations”:
Cell type annotation
(output of Replace column tool)Rename the generated file to
Anndata 5k PBMC gene_matrix magic cell_type
or add the tag galaxy-tagscell_type
to the dataset- Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:
- param-file “Annotated data matrix”:
Anndata 5k PBMC gene_matrix magic cell_type
(output of Manipulate AnnData tool)- “Method used for plotting”:
Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
- “Keys for annotations of observations/cells or variables/genes”:
cell_type
- param-toggle “Show edges?”:
No
- In “Plot attributes”
- “Location of legend”:
on data
- param-toggle “Draw a frame around the scatter plot?”:
Yes
- galaxy-eye Inspect the
.png
outputQuestion
- Are clusters with the same assigned cell type located close to each other?
Are the expected cell type percentages of PBMCs visible in the annotated plot?
Cell Type Expected Percentage CD3+ T cells 50-70% CD4+ T cells 25-45% CD8+ T cells 10-25% B cells 5-15% NK cells <5% Monocytes 10-30% Dendritic cells <2%
- Yes. B-cells are far away from NK and T cells. Only the myeloid lineage of monocytes and dendritic cells are located close to each other. This might be due to the common progenitor cell lineage and thus a similar chromatin profile.
- Yes. The most common cell types are T cells, followed by Monocytes and B cells. NK cells and Dendritic cells only make up a small percentage of PBMCs.
Conclusion
congratulations Well done, you’ve made it to the end! You might want to consult your results with this control history, or check out the full workflow for this tutorial.
In this tutorial, we produced a count matrix of scATAC-seq reads in the AnnData
format and performed:
- Preprocessing:
- Plotting the fragment-size distributions
- Calculating and plotting TSSe scores
- Filtering cells and selecting features (fixed-size genomic bins)
- Dimension reduction through Spectral embedding and UMAP
- Clustering of cells via the Leiden method
- Cluster annotation
- Producing and filtering a cell by gene activity matrix
- Data normalization and imputation with Scanpy and the MAGIC algorithm
- Visualizing marker genes in the clusters
- Manually annotating the cell types with selected marker genes
The scATAC-seq analysis can now continue with downstream analysis, for example differential peak analysis.
The SnapATAC2 tools for differential peak analysis are already accessible on Galaxy. However, there are no GTN trainings available yet. Until such a tutorial is uploaded, you can visit the SnapATAC2 documentation for a tutorial on differential peak analysis. And check out our example history and the exemplary workflow.
The tools are available in Galaxy under SnapATAC2 Peaks and Motif ( Galaxy version 2.6.4+galaxy1).
- If you want to continue with differential peak analysis, please make sure that the AnnData object with the annotated cell types contains unspecified annotations for the reference sequences (
uns: 'reference_sequences'
).
- The section Copy-over embeddings explains how to copy annotations from one AnnData object to another.