Single-cell ATAC-seq standard processing with SnapATAC2

Author(s)	Timon Schlegel
Editor(s)	Pavankumar Videm Björn Grüning
Tester(s)	Pavankumar Videm
Reviewers

Overview
Questions:

What does ATAC-seq data tell us about the cell?

Which steps are necessary to cluster the cells of single-cell ATAC-seq data?

Why is dimension reduction important for the analysis of single-cell data?

Objectives:

Learn how single-cell ATAC-seq data is processed

Create a count-matrix from a 10X fragment file

Perform filtering, dimension reduction and clustering on AnnData matrices

Generate and filter a cell-by-gene matrix

Identify marker genes for the clusters and annotate the cell types

Requirements:

Introduction to Galaxy Analyses

tutorial Hands-on: Pre-processing of 10X Single-Cell ATAC-seq Datasets

Time estimation: 4 hours

Level: Intermediate Intermediate

Supporting Materials:

Datasets

Workflows

FAQs

Published: Jul 11, 2024

Last modification: Jul 2, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00445

version Revision: 5

Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) analysis is a method to decipher the chromatin states of the analyzed cells. In general, genes are only expressed in accessible (i.e. “open”) chromatin and not in closed chromatin. By analyzing which genomic sites have an open chromatin state, cell-type specific patterns of gene accessibility can be determined. scATAC-seq is particularly useful for analyzing tissue containing different cell populations, such as peripheral blood mononuclear cells (PBMCs).

In this tutorial we will analyze scATAC-seq data using the tool suites SnapATAC2 (Zhang et al. 2024) and Scanpy (Wolf et al. 2018). With both of these tool suites we will perform preprocessing, clustering and identification of scATAC-seq datasets from 10x Genomics. The analysis will be performed using a dataset of PBMCs containing ~4,620 single nuclei.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab is a different view of the underlying Galaxy server that organises tools and resources better for single-cell users! It also provides a platform for communities to engage and connect; distribute more targeted news and events; and highlight community-specific funding sources.

Try it out!

subdomain Europe: Single Cell Omics Lab

subdomain USA: Single Cell Omics Lab

subdomain Australia: Single Cell Omics Lab

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

Open your Galaxy server

Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.

Navigate to your tutorial

Tool names in tutorials will be blue buttons that open the correct tool for you

Note: this does not work for all tutorials (yet)

You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface

Warning: Not all browsers work!

We’ve had some issues with Tutorial mode on Safari for Mac users.

Try a different browser if you aren’t seeing the button.

Comment

This tutorial is significantly based on “Standard pipeline” tutorial from SnapATAC2. It can be seen as the scATAC-seq counterpart to the scRNA-seq tutorial Clustering 3K PBMCs with Scanpy.

Agenda

In this tutorial, we will cover:

scATAC-seq with 10X Genomics

Data

Get Data

Creating a fragment file

Preprocessing

AnnData

Import files to SnapATAC2

Calculate and visualize QC metrics

Filtering

Feature selection

Doublet removal

Dimension reduction

Spectral embedding

UMAP embedding

Clustering

Community clustering

Plotting of clusters

Cell cluster annotation

Gene matrix

Imputation with Scanpy and MAGIC

Copy-over embeddings

Visualize gene activity of marker genes

Manual cluster annotation

Conclusion

scATAC-seq with 10X Genomics

ATAC-seq utilizes a hyperactive Tn5 transposase (Kia et al. 2017) to ligate adaptors to genome fragments, created by the transposase. Performing ATAC-seq on individual cells used to be an expensive and time-consuming labour. The 10X Chromium NextGEM system made scATAC-seq a cost-effective method for gaining high-resolution data with a simple protocol. After the transposition of nuclei in bulk, individual nuclei are put into Gel beads in Emulsion (GEM), containing unique 10x cell barcodes and sequencing adaptors for Illumina sequencing.

Library Preparation. — **Figure 1**: An overview of the 10X single-nuclei ATAC-seq library preparation

Data

The 5k PBMCs dataset for this tutorial is available for free from 10X Genomics. The blood samples were collected from a healthy donor and were prepared following the Chromium Next GEM scATAC-seq protocol. After sequencing on Illumina NovaSeq, the reads were processed by the Cell Ranger ATAC 2.0.0 pipeline from 10X to generate a Fragments File.

The Fragments File is a tabular file in a BED-like format, containing information about the position of the fragments on the chromosome and their corresponding 10x cell barcodes.

SnapATAC2 requires 3 input files for the standard pathway of processing:

fragments_file.tsv: A tabular file containing the chromosome positions of the fragments and their corresponding 10x cell barcodes.
chrom_sizes.txt: A tabular file of the number of bases of each chromosome of the reference genome
gene_annotation.gtf.gz: A tabular file listing genomic features of the reference genome (GENCODE GRCh38)

A chromosome sizes file can be generated using the tool Compute sequence length ( Galaxy version 1.0.3).

The reference genome can either be selected from cached genomes or uploaded to the galaxy history.

Comment

This tutorial starts with a fragment file.

SnapATAC2 also accepts mapped reads in a BAM file.

To learn how to get a fragment file or BAM file from raw .FASTQ-reads, please check out the tutorial “Pre-processing of 10X Single-Cell ATAC-seq Datasets”

If you would like to start the analysis with a BAM file, you can expand the details section “Details: Creating a fragment file”.

Get Data

Hands On: Data upload
Create a new history for this tutorial
Import the fragments_file.tsv, chrom_sizes.txt and gene_annotation.gtf.gz from Zenodo or from the shared data library
https://zenodo.org/records/12707159/files/atac_pbmc_5k_nextgem_fragments.tsv.gz
https://zenodo.org/records/12707159/files/chrom_sizes.txt
https://zenodo.org/records/12707159/files/gencode.v46.annotation.gtf.gz
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

galaxy-pencil Rename the file atac_pbmc_5k_nextgem_fragments.tsv to fragments_file.tsv

galaxy-pencil Rename the file gencode.v46.annotation.gtf.gz to gene_annotation.gtf.gz

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Inspect chrom_sizes and fragments_file

Question

How many chromosomes are in chrom_sizes?

In which column are the cell barcodes stored in the fragments_file?

There are 25 chromosomes. The 22 autosomes (Chr. 1-22), both sex chromosomes (Chr. X and Y) and the small circular mitochondrial chromosome (Chr. M).

The cell barcodes are unique 16 bp oligos, located in column 4.

Creating a fragment file

Hands On: fragment file
Import the file BAM_500-PBMC from Zenodo or from the shared data library
https://zenodo.org/records/12707159/files/BAM_500-PBMC.bam
This dataset contains mapped reads in the BAM format.

It was generated by following the tutorial “Pre-processing of 10X Single-Cell ATAC-seq Datasets” until the output of Map with BWA-MEM ( Galaxy version 0.7.18)
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Convert a BAM file to a fragment file, using 'pp.make_fragment_file'

param-file “File name of the BAM file”: BAM_500-PBMC (Input dataset)

param-toggle “Indicate whether the BAM file contain paired-end reads”: Yes

“How to extract barcodes from BAM records?”: From read names using regular expressions

“Extract barcodes from read names of BAM records using regular expressions”: (................):

Comment

Not every regular expression type is supported.

This expression selects 16 characters if they are followed by a colon. Only the cell barcodes of the BAM file will match.

Rename the generated file to Fragments 500 PBMC

Now you can continue with either the fragments_file from earlier or the new file Fragments 500 PBMC.

galaxy-info The tool pp.make_fragment_file tool has implemented additional quality control (QC) measures. These filter out larger fragments, which will be noticeable in the log-scale fragment size distribution.

galaxy-info Please note that Fragments 500 PBMC only contains 500 PBMCs and thus the clustering will produce different outputs compared to the outputs generated by fragments_file (with 5k PBMC).

Preprocessing

Preprocessing of the scATAC-seq data contained in the fragment file with SnapATAC2 begins with importing the files and computing basic QC metrics.

SnapATAC2 compresses and stores the fragments into an AnnData object.

AnnData

The AnnData format was initially developed for the Scanpy package and is now a widely accepted data format to store annotated data matrices in a space-efficient manner.

Anndata format

Import files to SnapATAC2

Hands On: Create an AnnData object

SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Import data fragment files and compute basic QC metrics, using 'pp.import_data'

param-file “Fragment file, optionally compressed with gzip or zstd”: fragments_file.tsv (Input dataset)

param-file “A tabular file containing chromosome names and sizes”: chrom_sizes.txt (Input dataset)

param-toggle “Whether the fragment file has been sorted by cell barcodes”: No

This tool requires the fragment file to be sorted according to cell barcodes.

If pp.make_fragment_file tool was used to generate the fragment file, this has automatically been done.

Otherwise, the setting “sorted by cell barcodes” should remain No.

Rename the generated file to Anndata 5k PBMC

Check that the format is h5ad

Because the AnnData format is an extension of the HDF5 format, i.e. a binary format, an AnnData object can not be inspected directly in Galaxy by clicking on the galaxy-eye (View data) icon. Instead, we need to use a dedicated tool from the AnnData suite.

Hands On: Inspect an AnnData object
Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Anndata 5k PBMC

“What to inspect?”: General information about the object
galaxy-eye Inspect the generated file
Question
AnnData object with n_obs × n_vars = 14232 × 0
 obs: 'n_fragment', 'frac_dup', 'frac_mito'
 uns: 'reference_sequences'
 obsm: 'fragment_paired'
How many observations are there? What do they represent?

How many variables are there?

There are 14232 observations, representing the cells.

There are 0 variables, representing genomic regions. This is because genome-wide 500-bp bins are only added after initial filtering.
Many toolsets producing outputs in AnnData formats in Galaxy, provide the general information by default:
Click on the name of the dataset in the history to expand it.
General Anndata information will be given in the expanded box:

e.g.
[n_obs x n_vars]
-    14232 × 0
details This feature isn’t the most reliable and might display:
[n_obs x n_vars]
-    1 × 1
In such cases and for more specific queries, Inspect AnnData ( Galaxy version 0.10.3+galaxy0) is required.

Calculate and visualize QC metrics

Hands On: Fragment-size distribution

SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for plotting”: Plot fragment size distribution, using 'pl.frag_size_distr'

param-file “Annotated data matrix”: Anndata 5k PBMC (output of pp.import_data tool)

galaxy-eye Inspect the .png output

SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for plotting”: Plot fragment size distribution, using 'pl.frag_size_distr'

param-file “Annotated data matrix”: Anndata 5k PBMC (output of pp.import_data tool)

“Change the y-axis (fragment counts) to log scale”: Yes

galaxy-eye Inspect the .png output

Question

What distinct features do the plots have? And what do they represent?

Which fragments are generally from open chromatin?

3 peaks are clearly visible (at <100-bp, ~200-bp and ~400-bp). The smallest fragments are from nucleosome-free regions, while the larger peaks of 200- and 400-bp contain mono- and di-nucleosome fragments, respectively.

The small fragments (<100-bp) are from open chromatin reads, since the Tn5 transposase could easily access the loosely packed DNA (Yan et al. 2020).

The transcription start site enrichment (TSSe) is another important QC metric. Nucleosome-free fragments are expected to be enriched at transcription start sites (TSS). TSSe shows increased fragmentation of chromatin around the TSS. This suggests open and accessible nucleosome-free chromatin.

TSSe is used as a QC metric, since an increased enrichment around TSS regions suggests that the experiment has captured biological meaningful genomic features. TSSe scores of individual cells can be calculated using SnapATAC2’s metrics.tsse function.

Hands On: Calculate and Plot TSSe

SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Compute the TSS enrichment score (TSSe) for each cell, using 'metrics.tsse'

param-file “Annotated data matrix”: Anndata 5k PBMC (output of pp.import_data tool)

param-file “GTF/GFF file containing the gene annotation”: gene_annotation.gtf.gz (Input dataset)

Rename the generated file to Anndata 5k PBMC TSSe or add the tag galaxy-tags TSSe to the dataset:

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for plotting”: Plot the TSS enrichment vs. number of fragments density figure, using 'pl.tsse'

param-file “Annotated data matrix”: Anndata 5k PBMC TSSe (output of metrics.tsse tool)

galaxy-eye Inspect the .png output

High-quality cells can be identified in the plot of TSSe scores against a number of unique fragments for each cell.

Question

Where are high-quality cells located in the plot?

Based on this plot, how should the filter be set?

The cells in the upper right are high-quality cells, enriched for TSS. Fragments in the lower left represent low-quality cells or empty droplets and should be filtered out.

Setting the minimum number of counts at 5,000 and the minimum TSS enrichment to 10.0 is an adequate filter.

Filtering

Based on the TSSe plot the cells can be filtered by TSSe and fragment counts.

Hands On: Filter cells
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'

param-file “Annotated data matrix”: Anndata 5k PBMC TSSe (output of metrics.tsse tool)

“Minimum number of counts required for a cell to pass filtering”: 5000

“Minimum TSS enrichemnt score required for a cell to pass filtering”: 10.0

“Maximum number of counts required for a cell to pass filtering”: 100000

Rename the generated file to Anndata 5k PBMC TSSe filtered or add the tag galaxy-tags filtered to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4564 × 0
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse'
 uns: 'reference_sequences'
 obsm: 'fragment_paired'
How have the observations changed, compared to the first Anndata 5k PBMC AnnData file?

What does this tell us about the quality of the data?

There are only 4564 observations, compared to the initial 14232 observations.

And the obs: 'tsse' has been added (but already during metrics.tssetool)

The empty droplets and low-quality cells have been filtered out, leaving us with 4564 high-quality cells.

Feature selection

Currently, our AnnData matrix does not contain any variables. The variables will be added in the following step with the function pp.add_tile_matrix. This creates a cell-by-bin matrix containing insertion counts across genome-wide 500-bp bins.

After creating the variables, the most accessible features are selected.

Hands On: Select features
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Generate cell by bin count matrix, using 'pp.add_tile_matrix'

param-file “Annotated data matrix”: Anndata 5k PBMC TSSe filtered (output of pp.filter_cells tool)

“The size of consecutive genomic regions used to record the counts”: 500

“The strategy to compute feature counts”: "insertion": based on the number of insertions that overlap with a region of interest

SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Perform feature selection, using 'pp.select_features'

param-file “Annotated data matrix”: Anndata tile_matrix (output of pp.add_tile_matrix tool)

“Number of features to keep”: 250000

Comment: Select features

Including more features improves resolution and can reveal finer details, but it may also introduce noise.

To optimize results, experiment with the n_features parameter to find the most appropriate value for your dataset.

To demonstrate the differences when selecting features, the following UMAP plots are the outputs from processing with a number of features between 1,000 and 500,000.

Fewer features result in fewer, but larger clusters. Selecting a lot of features will output more granular clusters and the compute time will increase.

Open image in new tab

Figure 2: UMAP plots with different selected features

At this step you can provide a blacklist or whitelist to specifically select relevant features.

For example the ENCODE Blacklist (Amemiya et al. 2019) can be applied here.

Rename the generated file to Anndata 5k PBMC select_features or add the tag galaxy-tags select_features to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4564 × 6062095
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse'
 var: 'count', 'selected'
 uns: 'reference_sequences'
 obsm: 'fragment_paired'
How did n_vars change compared to Anndata 5k PBMC TSSe filtered

Where are the selected features stored in the count matrix?

There are 6,062,095 variables, compared to 0 in TSSe filtered.

Selected features are stored in var: 'selected'.

Doublet removal

Doublets are removed by calling a customized scrublet (Wolock et al. 2019) algorithm. pp.scrublet will identify potential doublets and the function pp.filter_doublets removes them.

During single-cell sequencing, multiple nuclei can be encapsulated into the same 10x gel bead. The resulting multiplets (>97% of which are doublets) produce sequences from both cells.

Doublets can confound the results by appearing as “new” clusters or artifactual intermediary cell states.

These problematic doublets are called neotypic doublets, since they appear as “new” cell types.

Scrublet (Single-cell Remover of Doublets) is an algorithm which can detect neotypic doublets that produce false results.

The algorithm first simulates doublets by combining random pairs of observed cell features.

The observed features of the “cells” are then compared to the simulated doublets and scored on their doublet probability.

SnapATAC2’s pp.filter_doublets then removes all cells with a doublet probability >50%.

Open image in new tab

Figure 3: Scrublet simulates expected doublets and produces doublet scores for each cell. (Wolock et al. 2019)

Hands On: Scrublet
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Compute probability of being a doublet using the scrublet algorithm, using 'pp.scrublet'

param-file “Annotated data matrix”: Anndata 5k PBMC select_features (output of pp.select_features tool)

SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Remove doublets according to the doublet probability or doublet score, using 'pp.filter_doublets'

param-file “Annotated data matrix”: Anndata scrublet (output of pp.scrublet tool)

“Threshold for doublet probability”: 0.5

Rename the generated file to Anndata 5k PBMC filter_doublets or add the tag galaxy-tags filter_doublets to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4430 × 6062095
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score'
 var: 'count', 'selected'
 uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score'
 obsm: 'fragment_paired'
What was removed by pp.filter_doublets?

Where are the new annotations stored?

Cell doublets were removed. n_obs was reduced from 4564 to 4430 cells.

The outputs of pp.scrublet are stored in observations obs: 'doublet_probability', 'doublet_score' and in unstructured annotations uns: 'scrublet_sim_doublet_score'. The output of pp.filter_doublets is stored in uns: 'doublet_rate'.

Dimension reduction

Dimension reduction is a very important step during the analysis of single cell data. During this, the complex multi-dimensional data is projected into lower-dimensional space, while the lower-dimensional embedding of the complex data retains as much information as possible. Dimension reduction enables batch correction, data visualization and quicker downstream analysis since the data is more simplified and the memory usage is reduced (Zhang et al. 2024).

Dimension reduction algorithms can be either linear or non-linear.

Linear methods are generally computationally efficient and well scalable.

A popular linear dimension reduction algorithm is:

PCA (Principle Component Analysis), implemented in Scanpy (please check out our Scanpy tutorial for an explanation).

Nonlinear methods however are well suited for multimodal and complex datasets.

in contrast to linear methods, which often preserve global structures, non-linear methods have a locality-preserving character.

This makes non-linear methods relatively insensitive to outliers and noise while emphasizing natural clusters in the data (Belkin and Niyogi 2003)

As such, they are implemented in many algorithms to visualize the data in 2 dimensions (f.ex. UMAP embedding).

The nonlinear dimension reduction algorithm, through matrix-free spectral embedding, used in SnapATAC2 is a very fast and memory efficient non-linear algorithm (Zhang et al. 2024).

Spectral embedding utilizes an iterative algorithm to calculate the spectrum (eigenvalues and eigenvectors) of a matrix without computing the matrix itself.

For a simple introduction into spectral embedding and how it compares to PCA, please check out the blog post “On Laplacian Eigenmaps for Dimensionality Reduction” by Juan Orduz.

Spectral embedding

The dimension reduction, produced by the algorithm tl.spectral, is required for later steps, such as plotting and clustering.

Hands On: Spectral embedding
SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Dimension reduction and Clustering”: Perform dimension reduction using Laplacian Eigenmap, using 'tl.spectral'

param-file “Annotated data matrix”: Anndata 5k PBMC filter_doublets (output of pp.filter_doublets tool)

“Distance metric”: cosine

Comment: Distance metric

The fast and well scalable “matrix-free spectral embedding” algorithm depends on the distance metric: cosine

Rename the generated file to Anndata 5k PBMC spectral or add the tag galaxy-tags spectral to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4430 × 6062095
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score'
 var: 'count', 'selected'
 uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
 obsm: 'fragment_paired', 'X_spectral'
Where are the new annotations stored?

The outputs of tl.spectral are stored in unstructured annotations uns: 'spectral_eigenvalue' and as multidimensional observations obsm: 'X_spectral'.

UMAP embedding

With the already reduced dimensionality of the data stored in X_spectral, the cells can be further embedded (i.e. transformed into lower dimensions) with Uniform Manifold Approximation and Projection (UMAP). UMAP projects the cells and their relationship to each other into 2-dimensional space, which can be easily visualized (McInnes et al. 2018).

Hands On: UMAP embedding

SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Dimension reduction and Clustering”: Compute Umap, using 'tl.umap'

param-file “Annotated data matrix”: Anndata 5k PBMC spectral (output of tl.spectral tool)

“Use the indicated representation in ‘.obsm’“: X_spectral

Rename the generated file to Anndata 5k PBMC UMAP or add the tag galaxy-tags UMAP to the dataset

Clustering

During clustering, cells that share similar accessibility profiles are organized into clusters. SnapATAC2 utilizes graph-based community clustering with the Leiden algorithm (Traag et al. 2019). This method takes the k-nearest neighbor (KNN) graph as input data and produces well-connected communities.

Community clustering

Hands On: Clustering analysis
SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Dimension reduction and Clustering”: Compute a neighborhood graph of observations, using 'pp.knn'

param-file “Annotated data matrix”: Anndata 5k PBMC UMAP (output of tl.umap tool)

SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Dimension reduction and Clustering”: Cluster cells into subgroups, using 'tl.leiden'

param-file “Annotated data matrix”: Anndata knn (output of pp.knn tool)

“Whether to use the Constant Potts Model (CPM) or modularity”: modularity

Comment: CPM or modularity

make sure you selected modularity

the clusters produced by CPM are not represented well in the UMAP projections

Rename the generated file to Anndata 5k PBMC leiden or add the tag galaxy-tags leiden to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4564 × 6062095
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden'
 var: 'count', 'selected'
 uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
 obsm: 'X_spectral', 'X_umap', 'fragment_paired'
 obsp: 'distances'
Where are the leiden clusters stored in the AnnData?

The clusters are stored in obs: 'leiden'

Plotting of clusters

Now that we have produced UMAP embeddings of our cells and have organized the cells into leiden clusters, we can now visualize this information with pl.umap.

Hands On: Plotting the clusters

SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for plotting”: Plot the UMAP embedding, using 'pl.umap'

param-file “Annotated data matrix”: Anndata 5k PBMC leiden (output of tl.leiden tool)

“Color”: leiden

“Height of the plot”: 500

galaxy-eye Inspect the .png output

Question

How many leiden clusters were discovered?

What does the distance of clusters to each other tell us about their chromatin states?

There are 13 leiden clusters.

Clusters in close proximity (f.ex. clusters 0 and 5) share a similar chromatin accessibility profile (and very likely also a similar cell type), compared to a cluster further away (f.ex. cluster 9).

Cell cluster annotation

After clustering the cells, they must be annotated. This categorizes the clusters into known cell types. Manual Cell Annotation requires known marker genes and varying expression profiles of the marker genes among clusters.

Luckily, the marker genes for PBMCs are known (Sun et al. 2019). Marker genes for other single cell datasets can also be found in databases such as PanglaoDB (Franzén et al. 2019). Using the known marker genes we can now annotate our clusters manually.

Gene matrix

Since our data currently doesn’t contain gene information, we have to create a cell-by-gene activity matrix using the function pp.make_gene_matrix.

Hands On: Task description
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:

“Method used for preprocessing”: Generate cell by gene activity matrix, using 'pp.make_gene_matrix'

param-file “Annotated data matrix”: Anndata 5k PBMC leiden (output of tl.leiden tool)

param-file “GTF/GFF file containing the gene annotation”: gene_annotation.gtf.gz (Input dataset)

Rename the generated file to Anndata 5k PBMC gene_matrix or add the tag galaxy-tags gene_matrix to the dataset

Please note that pp.make_gene_matrix removes all annotations except those stored in obs.

Therefore it might be necessary to remove propagating tags galaxy-tags (tags starting with #) from Anndata 5k PBMC gene_matrix.

Tags can be removed by expanding the dataset with a tag and clicking galaxy-cross next to the tag.
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4430 × 60606
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden'
What does n_vars represent in Anndata 5k PBMC gene_matrix and what did it represent in Anndata 5k PBMC leiden?

The variables now represent accessible genes. There are 60606 accessible genes in our samples. In Anndata 5k PBMC leiden and all earlier AnnData the variables represented the 6062095 fixed-sized genomic bins.

Imputation with Scanpy and MAGIC

Similar to scRNA-seq data, the cell-by-gene-activity matrix is very sparse. Additionally, high gene variance between cells, due to technical confounders, could impact the downstream analysis. In scRNA-seq, filtering and normalization are therefore required to produce a high-quality gene matrix.

Since the cell-by-gene-activity matrix resembles the cell-by-gene-expression matrix of scRNA-seq, we can use the tools of the Scanpy (Wolf et al. 2018) tool suite to continue with our data.

The count matrices of single-cell data are sparse and noisy.

Confounding issues, such as “dropout” effects, where some mRNA or DNA-segments are not detected although they are present in the cell, also result in some cells missing important cell-type defining features.

These problems can obscure the data, as only the strongest gene-gene relationships are still detectable.

The Markov Affinity-based Graph Imputation of Cells (MAGIC) algorithm (van Dijk et al. 2018) tries to solve these issues by filling in missing data from some cells with transcript information from similar cells.

The algorithm calculates the likely gene expression of a single cell, based on similar cells and fills in the missing data to produce the expected expression.

MAGIC achieves this by building a graph from the data and using data diffusion to smooth out the noise.

Open image in new tab

Figure 4: MAGIC restores noisy and sparse single-cell data using diffusion geometry (van Dijk et al. 2018)

Hands On: Filter and normalize
Filter ( Galaxy version 1.9.6+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix (output of pp.make_gene_matrix tool)

“Method used for filtering”: Filter genes based on number of cells or counts, using 'pp.filter_genes'

param-select “Filter”: Minimum number of cells expressed

“Minimum number of cells expressed required for a gene to pass filtering”: 5

Normalize ( Galaxy version 1.9.6+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Anndata filter_genes (output of Filter tool)

“Method used for normalization”: Normalize counts per cell, using 'pp.normalize_total'

param-toggle “Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell”: No

Inspect and manipulate ( Galaxy version 1.9.6+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Anndata normalize (output of Normalize tool)

“Method used for inspecting”: Logarithmize the data matrix, using 'pp.log1p'

Normalize ( Galaxy version 1.9.6+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Anndata log1p (output of log1p tool)

“Method used for normalization”: Denoising using Markov Affinity-based Graph Imputation of Cells (MAGIC) API 'external.pp.magic'

“Denoised genes to return”: All genes

“Which solver to use”: "approximate", is faster that performs imputation in the PCA space and then projects back to the gene space

Comment:

Choosing the setting Which solver to use: 'exact' will result in a output file with better resolution.

This is not necessary for our purposes, since the compute time also increases with this setting.

Rename the generated file to Anndata 5k PBMC magic or add the tag galaxy-tags magic to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Question
AnnData object with n_obs × n_vars = 4430 × 55106
 obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden', 'n_genes', 'n_counts'
 var: 'n_cells', 'n_counts'
 uns: 'log1p'
How did n_vars change, compared to Anndata 5k PBMC gene_matrix?

Which data was ‘lost’, compared to Anndata 5k PBMC leiden, and must be added to the file in order to produce UMAP plots?

The number of genes was reduced from 60606 to 55106 by the filtering. Additional annotations were added, such as: obs: 'n_genes', 'n_counts', var: 'n_cells', 'n_counts' and uns: 'log1p'.

The UMAP embeddings obsm: 'X_umap' are missing and should be added to the Anndata in the next step. Without X_umap it won’t be possible to visualize the plots.

Copy-over embeddings

Hands On: Copy UMAP embedding

AnnData Operations ( Galaxy version 1.9.3+galaxy0) with the following parameters:

param-file “Input object in hdf5 AnnData format”: Anndata 5k PBMC magic (output of external.pp.magic tool)

“Copy embeddings (such as UMAP, tSNE)”: Yes

“Keys from embeddings to copy”: X_umap

param-file “IAnnData objects with embeddings to copy”: Anndata 5k PBMC leiden

Comment: Annotations to copy

This tutorial only focuses on producing an UMAP plot with marker-genes.

If further analysis, with tools requiring more annotations, is intended, these can be added in a similar way as shown above.

f.ex. Peak and Motif Analysis with Snapatac2 peaks and motif ( Galaxy version 2.6.4+galaxy1) requires annotations from uns.

It is also possible to leave the input “Keys from embeddings to copy” empty, to copy all annotations of a given category such as obsm.

Rename the generated file to Anndata 5k PBMC magic UMAP or add the tag galaxy-tags UMAP to the dataset

galaxy-eye Inspect the general information of the .h5ad output, to check if obsm contains X_umap
  AnnData object with n_obs × n_vars = 4430 × 55106
   obs: obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden', 'n_genes', 'n_counts'
   var: 'n_cells', 'n_counts'
   uns: 'log1p'
   obsm: 'X_umap'

Visualize gene activity of marker genes

The gene activity of selected marker genes can now be visualized with Scanpy.

Hands On: Plot marker genes

Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:

param-file “Annotated data matrix”: output_h5ad (output of AnnData Operations tool)

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: leiden, CD3D, CD8A, CD4, MS4A1, NKG7, CD14, FCER1A

param-toggle “Show edges?”: No

In “Plot attributes”

“Number of panels per row”: 2

Question

Are the marker genes selectively expressed in the clusters?

Which marker genes have overlapping expression profiles? And what could that imply?

Some marker genes, such as MS4A1 or CD8A, are only expressed in a few clusters (clusters 6+11 and clusters 1+8, respectively).

The marker gene CD3D is expressed in multiple clusters (1, 2, 4, 7 and 8). Overlapping expression profiles imply similar cell types since similar cell types have similar marker genes upregulated. In this case, CD3D expression classifies the cells in these clusters as T-cells.

Manual cluster annotation

Comparison of marker gene expression in our clusters with a table of canonical marker genes (Sun et al. 2019), enables us to annotate the clusters manually.

Cell type	Marker genes
CD8+ T cells	CD3D+, CD8A+, CD4-
CD4+ T cells	CD3D+, CD8A-, CD4+
B cells	CD3D-, MS4A1+
Natural killer (NK) cells	CD3D-, NKG7+
Monocytes	CD3D-, CD14+
Dendritic cells	CD3D-, FCER1A+

These canonical marker genes can match the clusters to known cell types:

Cluster	Cell type
0	Monocytes
1	CD8+ T cells
2	CD4+ T cells
3	Monocytes
4	CD4+ T cells
5	Monocytes
6	B cells
7	CD4+ T cells
8	CD8+ T cells
9	NK cells
10	Monocytes
11	B cells
12	Dendritic cells

Comment

Note that some clusters contain subtypes (f.ex. the annotated B cell clusters contain both naive and memory B cells). The cell-type annotation can be refined by choosing more specific marker genes.

To manually annotate the Leiden clusters, we will need to perform multiple steps:

Inspect the key-indexed observations of Anndata 5k PBMC gene_matrix magic UMAP
Cut the Leiden annotations out of the table
Upload a replace file containing the new cell type annotations for the Leiden clusters
Replace the values of the cluster annotation with cell type annotation
Add the cell type annotation to the AnnData
Plot the annotated cell types

Hands On: Manual annotation
Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix magic UMAP

“What to inspect?”: Key-indexed observations annotation

Rename the generated file to 5k PBMC observations

galaxy-eye Inspect the generated file

Question

In which column is the Leiden annotation located?

The Leiden annotation is in column 8.

Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9 Column 10

”” n_fragment frac_dup frac_mito tsse doublet_probability doublet_score leiden n_genes n_counts

AAACGAAAGACGTCAG-1 22070 0.5219425551271499 0.0 30.43315066436454 0.004634433324822066 0.009276437847866418 8 52303 16521.599844068267

AAACGAAAGATTGACA-1 10500 0.5345125681606597 0.0 29.10551296093465 0.004668403569267374 0.001088139281828074 1 54501 15020.42495602328

AAACGAAAGGGTCCCT-1 19201 0.5101785714285714 0.0 19.90011850347046 0.004634433324822066 0.009276437847866418 5 54212 16294.751533305309

AAACGAACAATTGTGC-1 13242 0.487399837417257 0.0 29.060913705583758 0.004660125753854076 0.0022172949002217295 7 53530 15456.629863655084

Cut columns ( Galaxy version 1.0.2) with the following parameters:

param-select “Cut columns”: c8

param-file “From”: 5k PBMC observations (output of Inspect AnnData tool)
Create a new tabular file from the following
leiden cell_type
0 Monocytes
1 CD8_Tcells
2 CD4_Tcells
3 Monocytes
4 CD4_Tcells
5 Monocytes
6 Bcells
7 CD4_Tcells
8 CD8_Tcells
9 NKcells
10 Monocytes
11 Bcells
12 Dendritic_cells
Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data at the bottom

Paste the file contents into the text field

Change the dataset name from “New File” to replace_file

Change Type from “Auto-detect” to tabular* From the Settings menu (galaxy-gear) select Convert spaces to tabs* Press Start and Close the window

The first column of the replace file contains the “old” annotations and the second column contains the “new” annotation.

warning Spaces in the new annotations can lead to errors. Please use underscores (_) instead.
Replace column ( Galaxy version 0.2) with the following parameters:

param-file “File in which you want to replace some values”: Cut columns leiden (output of Cut columns tool)

param-file “Replace information file”: replace_file

“Which column should be replaced?: Column: 1

Rename the generated file to Cell type annotation and galaxy-eye inspect the file to check if the replacement was successful

check if the datatype is set to tabular

you may need to change the datatype manually

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Manipulate AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix magic UMAP

“Function to manipulate the object”: Add new annotation(s) for observations or variables

“What to annotate”: Observations (obs)

param-file “Table with new annotations”: Cell type annotation(output of Replace column tool)

Rename the generated file to Anndata 5k PBMC gene_matrix magic cell_type or add the tag galaxy-tags cell_type to the dataset

Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix magic cell_type (output of Manipulate AnnData tool)

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: cell_type

param-toggle “Show edges?”: No

In “Plot attributes”

“Location of legend”: on data

param-toggle “Draw a frame around the scatter plot?”: Yes

galaxy-eye Inspect the .png output
Question

Are clusters with the same assigned cell type located close to each other?

Are the expected cell type percentages of PBMCs visible in the annotated plot?

Cell Type Expected Percentage

CD3+ T cells 50-70%

CD4+ T cells 25-45%

CD8+ T cells 10-25%

B cells 5-15%

NK cells <5%

Monocytes 10-30%

Dendritic cells <2%

Yes. B-cells are far away from NK and T cells. Only the myeloid lineage of monocytes and dendritic cells are located close to each other. This might be due to the common progenitor cell lineage and thus a similar chromatin profile.

Yes. The most common cell types are T cells, followed by Monocytes and B cells. NK cells and Dendritic cells only make up a small percentage of PBMCs.

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6	Column 7	Column 8	Column 9	Column 10
””	n_fragment	frac_dup	frac_mito	tsse	doublet_probability	doublet_score	leiden	n_genes	n_counts
AAACGAAAGACGTCAG-1	22070	0.5219425551271499	0.0	30.43315066436454	0.004634433324822066	0.009276437847866418	8	52303	16521.599844068267
AAACGAAAGATTGACA-1	10500	0.5345125681606597	0.0	29.10551296093465	0.004668403569267374	0.001088139281828074	1	54501	15020.42495602328
AAACGAAAGGGTCCCT-1	19201	0.5101785714285714	0.0	19.90011850347046	0.004634433324822066	0.009276437847866418	5	54212	16294.751533305309
AAACGAACAATTGTGC-1	13242	0.487399837417257	0.0	29.060913705583758	0.004660125753854076	0.0022172949002217295	7	53530	15456.629863655084

Cell Type	Expected Percentage
CD3+ T cells	50-70%
CD4+ T cells	25-45%
CD8+ T cells	10-25%
B cells	5-15%
NK cells	<5%
Monocytes	10-30%
Dendritic cells	<2%

Conclusion

congratulations Well done, you’ve made it to the end! You might want to consult your results with this control history, or check out the full workflow for this tutorial.

In this tutorial, we produced a count matrix of scATAC-seq reads in the AnnData format and performed:

Preprocessing:
1. Plotting the fragment-size distributions
2. Calculating and plotting TSSe scores
3. Filtering cells and selecting features (fixed-size genomic bins)
Dimension reduction through Spectral embedding and UMAP
Clustering of cells via the Leiden method
Cluster annotation
1. Producing and filtering a cell by gene activity matrix
2. Data normalization and imputation with Scanpy and the MAGIC algorithm
3. Visualizing marker genes in the clusters
4. Manually annotating the cell types with selected marker genes

The scATAC-seq analysis can now continue with downstream analysis, for example differential peak analysis.

The SnapATAC2 tools for differential peak analysis are already accessible on Galaxy. However, there are no GTN trainings available yet. Until such a tutorial is uploaded, you can visit the SnapATAC2 documentation for a tutorial on differential peak analysis. And check out our example history and the exemplary workflow.

The tools are available in Galaxy under SnapATAC2 Peaks and Motif ( Galaxy version 2.6.4+galaxy1).

If you want to continue with differential peak analysis, please make sure that the AnnData object with the annotated cell types contains unspecified annotations for the reference sequences (uns: 'reference_sequences').

The section Copy-over embeddings explains how to copy annotations from one AnnData object to another.

You've Finished the Tutorial

Key points

Single-cell ATAC-seq can identify open chromatin sites

Dimension reduction is required to simplify the data while preserving important information about the relationships of cells to each other.

Clusters of similar cells can be annotated to their respective cell-types

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Belkin, M., and P. Niyogi, 2003 Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15: 1373–1396. 10.1162/089976603321780317
Kia, A., C. Gloeckner, T. Osothprarop, N. Gormley, E. Bomati et al., 2017 Improved genome sequencing using an engineered transposase. BMC Biotechnology 17: 10.1186/s12896-016-0326-1
McInnes, L., J. Healy, and J. Melville, 2018 UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 10.48550/ARXIV.1802.03426 https://arxiv.org/abs/1802.03426
Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19: 10.1186/s13059-017-1382-0
Dijk, D. van, R. Sharma, J. Nainys, K. Yim, P. Kathail et al., 2018 Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 174: 716–729.e27. 10.1016/j.cell.2018.05.061
Amemiya, H. M., A. Kundaje, and A. P. Boyle, 2019 The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific Reports 9: 10.1038/s41598-019-45839-z
Franzén, O., L.-M. Gan, and J. L. M. Björkegren, 2019 PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019: 10.1093/database/baz046
Sun, Z., L. Chen, H. Xin, Y. Jiang, Q. Huang et al., 2019 A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies. Nature Communications 10: 10.1038/s41467-019-09639-3
Traag, V. A., L. Waltman, and N. J. van Eck, 2019 From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9: 10.1038/s41598-019-41695-z
Wolock, S. L., R. Lopez, and A. M. Klein, 2019 Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Systems 8: 281–291.e9. 10.1016/j.cels.2018.11.005
Yan, F., D. R. Powell, D. J. Curtis, and N. C. Wong, 2020 From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis. Genome Biology 21: 10.1186/s13059-020-1929-3
Zhang, K., N. R. Zemke, E. J. Armand, and B. Ren, 2024 A fast, scalable and versatile tool for analysis of single-cell omics data. Nature Methods 21: 217–227. 10.1038/s41592-023-02139-9

Glossary

PBMCs: peripheral blood mononuclear cells
QC: quality control
TSS: transcription start sites
TSSe: transcription start site enrichment
UMAP: Uniform Manifold Approximation and Projection
scATAC-seq: Single-cell Assay for Transposase-Accessible Chromatin using sequencing

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Timon Schlegel, Single-cell ATAC-seq standard processing with SnapATAC2 (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-scatac-standard-processing-snapatac2,
author = "Timon Schlegel",
	title = "Single-cell ATAC-seq standard processing with SnapATAC2 (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

EuroScienceGateway was funded by the European Union programme Horizon Europe (HORIZON-INFRA-2021-EOSC-01-04) under grant agreement number 101057388 and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee grant number 10038963.

de.KCD

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Importing files from public atlases

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: replace_column_by_key_value_file
  owner: bgruening
  revisions: d533e4b75800
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bwa
  owner: devteam
  revisions: 4a196b9c72c2
  tool_panel_section_label: Mapping
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: cut_columns
  owner: devteam
  revisions: cec635fab700
  tool_panel_section_label: Single Cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: fasta_compute_length
  owner: devteam
  revisions: 7d37cfda8e00
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_ops
  owner: ebi-gxa
  revisions: a4774b7b2e85
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_inspect
  owner: iuc
  revisions: 6f0d0c784f09
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_manipulate
  owner: iuc
  revisions: ed4996a16f7f
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter
  owner: iuc
  revisions: d636ce5cde16
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_inspect
  owner: iuc
  revisions: f54f0f0598ad
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_normalize
  owner: iuc
  revisions: d844935c906c
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot
  owner: iuc
  revisions: f7446fd8d5e7
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: snapatac2_clustering
  owner: iuc
  revisions: d3ea0ba3d066
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: snapatac2_peaks_and_motif
  owner: iuc
  revisions: 64fa083411f1
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: snapatac2_plotting
  owner: iuc
  revisions: 05bd4db20227
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: snapatac2_preprocessing
  owner: iuc
  revisions: c5c94e01a1b5
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.