Single-cell ATAC-seq standard processing with SnapATAC2

Overview
Creative Commons License: CC-BY Questions:
  • What does ATAC-seq data tell us about the cell?

  • Which steps are necessary to cluster the cells of single-cell ATAC-seq data?

  • Why is dimension reduction important for the analysis of single-cell data?

Objectives:
  • Learn how single-cell ATAC-seq data is processed

  • Create a count-matrix from a 10X fragment file

  • Perform filtering, dimension reduction and clustering on AnnData matrices

  • Generate and filter a cell-by-gene matrix

  • Identify marker genes for the clusters and annotate the cell types

Requirements:
Time estimation: 4 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Jul 11, 2024
Last modification: Aug 9, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00445
version Revision: 2

Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) analysis is a method to decipher the chromatin states of the analyzed cells. In general, genes are only expressed in accessible (i.e. “open”) chromatin and not in closed chromatin. By analyzing which genomic sites have an open chromatin state, cell-type specific patterns of gene accessibility can be determined. scATAC-seq is particularly useful for analyzing tissue containing different cell populations, such as peripheral blood mononuclear cells (PBMCs).

In this tutorial we will analyze scATAC-seq data using the tool suites SnapATAC2 (Zhang et al. 2024) and Scanpy (Wolf et al. 2018). With both of these tool suites we will perform preprocessing, clustering and identification of scATAC-seq datasets from 10x Genomics. The analysis will be performed using a dataset of PBMCs containing ~4,620 single nuclei.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!

Try it out! All your histories/workflows/logins from the general European Galaxy server will be there!

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Comment

This tutorial is significantly based on “Standard pipeline” tutorial from SnapATAC2. It can be seen as the scATAC-seq counterpart to the scRNA-seq tutorial Clustering 3K PBMCs with Scanpy.

Agenda

In this tutorial, we will cover:

  1. scATAC-seq with 10X Genomics
  2. Data
    1. Get Data
    2. Creating a fragment file
  3. Preprocessing
    1. AnnData
    2. Import files to SnapATAC2
    3. Calculate and visualize QC metrics
    4. Filtering
    5. Feature selection
    6. Doublet removal
  4. Dimension reduction
    1. Spectral embedding
    2. UMAP embedding
  5. Clustering
    1. Community clustering
    2. Plotting of clusters
  6. Cell cluster annotation
    1. Gene matrix
    2. Imputation with Scanpy and MAGIC
    3. Copy-over embeddings
    4. Visualize gene activity of marker genes
    5. Manual cluster annotation
  7. Conclusion

scATAC-seq with 10X Genomics

ATAC-seq utilizes a hyperactive Tn5 transposase (Kia et al. 2017) to ligate adaptors to genome fragments, created by the transposase. Performing ATAC-seq on individual cells used to be an expensive and time-consuming labour. The 10X Chromium NextGEM system made scATAC-seq a cost-effective method for gaining high-resolution data with a simple protocol. After the transposition of nuclei in bulk, individual nuclei are put into Gel beads in Emulsion (GEM), containing unique 10x cell barcodes and sequencing adaptors for Illumina sequencing.

Library Preparation. Open image in new tab

Figure 1: An overview of the 10X single-nuclei ATAC-seq library preparation

Data

The 5k PBMCs dataset for this tutorial is available for free from 10X Genomics. The blood samples were collected from a healthy donor and were prepared following the Chromium Next GEM scATAC-seq protocol. After sequencing on Illumina NovaSeq, the reads were processed by the Cell Ranger ATAC 2.0.0 pipeline from 10X to generate a Fragments File.

The Fragments File is a tabular file in a BED-like format, containing information about the position of the fragments on the chromosome and their corresponding 10x cell barcodes.

SnapATAC2 requires 3 input files for the standard pathway of processing:

  • fragments_file.tsv: A tabular file containing the chromosome positions of the fragments and their corresponding 10x cell barcodes.
  • chrom_sizes.txt: A tabular file of the number of bases of each chromosome of the reference genome
  • gene_annotation.gtf.gz: A tabular file listing genomic features of the reference genome (GENCODE GRCh38)
  • A chromosome sizes file can be generated using the tool Compute sequence length ( Galaxy version 1.0.3).
  • The reference genome can either be selected from cached genomes or uploaded to the galaxy history.
Comment

Get Data

Hands-on: Data upload
  1. Create a new history for this tutorial
  2. Import the fragments_file.tsv, chrom_sizes.txt and gene_annotation.gtf.gz from Zenodo or from the shared data library

    https://zenodo.org/records/12707159/files/atac_pbmc_5k_nextgem_fragments.tsv.gz
    https://zenodo.org/records/12707159/files/chrom_sizes.txt
    https://zenodo.org/records/12707159/files/gencode.v46.annotation.gtf.gz
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
    • galaxy-pencil Rename the file atac_pbmc_5k_nextgem_fragments.tsv to fragments_file.tsv
    • galaxy-pencil Rename the file gencode.v46.annotation.gtf.gz to gene_annotation.gtf.gz
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

  4. Inspect chrom_sizes and fragments_file
Question
  1. How many chromosomes are in chrom_sizes?
  2. In which column are the cell barcodes stored in the fragments_file?
  1. There are 25 chromosomes. The 22 autosomes (Chr. 1-22), both sex chromosomes (Chr. X and Y) and the small circular mitochondrial chromosome (Chr. M).
  2. The cell barcodes are unique 16 bp oligos, located in column 4.

Creating a fragment file

Hands-on: fragment file
  1. Import the file BAM_500-PBMC from Zenodo or from the shared data library
    https://zenodo.org/records/12707159/files/BAM_500-PBMC.bam
    
  2. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Convert a BAM file to a fragment file, using 'pp.make_fragment_file'
      • param-file “File name of the BAM file”: BAM_500-PBMC (Input dataset)
      • param-toggle “Indicate whether the BAM file contain paired-end reads”: Yes
      • “How to extract barcodes from BAM records?”: From read names using regular expressions
        • “Extract barcodes from read names of BAM records using regular expressions”: (................):
    Comment
    • Not every regular expression type is supported.
    • This expression selects 16 characters if they are followed by a colon. Only the cell barcodes of the BAM file will match.
  3. Rename the generated file to Fragments 500 PBMC
  4. Now you can continue with either the fragments_file from earlier or the new file Fragments 500 PBMC.
    • galaxy-info The tool pp.make_fragment_file tool has implemented additional quality control (QC) measures. These filter out larger fragments, which will be noticeable in the log-scale fragment size distribution.
    • galaxy-info Please note that Fragments 500 PBMC only contains 500 PBMCs and thus the clustering will produce different outputs compared to the outputs generated by fragments_file (with 5k PBMC).

Preprocessing

Preprocessing of the scATAC-seq data contained in the fragment file with SnapATAC2 begins with importing the files and computing basic QC metrics.

SnapATAC2 compresses and stores the fragments into an AnnData object.

AnnData

The AnnData format was initially developed for the Scanpy package and is now a widely accepted data format to store annotated data matrices in a space-efficient manner.

Anndata format

Import files to SnapATAC2

Hands-on: Create an AnnData object
  1. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Import data fragment files and compute basic QC metrics, using 'pp.import_data'
      • param-file “Fragment file, optionally compressed with gzip or zstd”: fragments_file.tsv (Input dataset)
      • param-file “A tabular file containing chromosome names and sizes”: chrom_sizes.txt (Input dataset)
      • param-toggle “Whether the fragment file has been sorted by cell barcodes”: No
    • This tool requires the fragment file to be sorted according to cell barcodes.
    • If pp.make_fragment_file tool was used to generate the fragment file, this has automatically been done.
      • Otherwise, the setting “sorted by cell barcodes” should remain No.
  2. Rename the generated file to Anndata 5k PBMC

  3. Check that the format is h5ad

Because the AnnData format is an extension of the HDF5 format, i.e. a binary format, an AnnData object can not be inspected directly in Galaxy by clicking on the galaxy-eye (View data) icon. Instead, we need to use a dedicated tool from the AnnData suite.

Hands-on: Inspect an AnnData object
  1. Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Anndata 5k PBMC
    • “What to inspect?”: General information about the object
  2. galaxy-eye Inspect the generated file

    Question
    AnnData object with n_obs × n_vars = 14232 × 0
     obs: 'n_fragment', 'frac_dup', 'frac_mito'
     uns: 'reference_sequences'
     obsm: 'fragment_paired'
    
    1. How many observations are there? What do they represent?
    2. How many variables are there?
    1. There are 14232 observations, representing the cells.
    2. There are 0 variables, representing genomic regions. This is because genome-wide 500-bp bins are only added after initial filtering.
    • Many toolsets producing outputs in AnnData formats in Galaxy, provide the general information by default:
      • Click on the name of the dataset in the history to expand it.
      • General Anndata information will be given in the expanded box:

        e.g.

        [n_obs x n_vars]
        -    14232 × 0
        
      • details This feature isn’t the most reliable and might display:

        [n_obs x n_vars]
        -    1 × 1
        
    • In such cases and for more specific queries, Inspect AnnData ( Galaxy version 0.10.3+galaxy0) is required.

Calculate and visualize QC metrics

Hands-on: Fragment-size distribution
  1. SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for plotting”: Plot fragment size distribution, using 'pl.frag_size_distr'
      • param-file “Annotated data matrix”: Anndata 5k PBMC (output of pp.import_data tool)
  2. galaxy-eye Inspect the .png output
  3. SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for plotting”: Plot fragment size distribution, using 'pl.frag_size_distr'
      • param-file “Annotated data matrix”: Anndata 5k PBMC (output of pp.import_data tool)
      • “Change the y-axis (fragment counts) to log scale”: Yes
  4. galaxy-eye Inspect the .png output

fragment_size_distribution.

log_fragment_size_distribution.

Question
  1. What distinct features do the plots have? And what do they represent?
  2. Which fragments are generally from open chromatin?
  1. 3 peaks are clearly visible (at <100-bp, ~200-bp and ~400-bp). The smallest fragments are from nucleosome-free regions, while the larger peaks of 200- and 400-bp contain mono- and di-nucleosome fragments, respectively.
  2. The small fragments (<100-bp) are from open chromatin reads, since the Tn5 transposase could easily access the loosely packed DNA (Yan et al. 2020).

The transcription start site enrichment (TSSe) is another important QC metric. Nucleosome-free fragments are expected to be enriched at transcription start sites (TSS). TSSe shows increased fragmentation of chromatin around the TSS. This suggests open and accessible nucleosome-free chromatin.

TSSe is used as a QC metric, since an increased enrichment around TSS regions suggests that the experiment has captured biological meaningful genomic features. TSSe scores of individual cells can be calculated using SnapATAC2’s metrics.tsse function.

Hands-on: Calculate and Plot TSSe
  1. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Compute the TSS enrichment score (TSSe) for each cell, using 'metrics.tsse'
      • param-file “Annotated data matrix”: Anndata 5k PBMC (output of pp.import_data tool)
      • param-file “GTF/GFF file containing the gene annotation”: gene_annotation.gtf.gz (Input dataset)
  2. Rename the generated file to Anndata 5k PBMC TSSe or add the tag galaxy-tags TSSe to the dataset:

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

  3. SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for plotting”: Plot the TSS enrichment vs. number of fragments density figure, using 'pl.tsse'
      • param-file “Annotated data matrix”: Anndata 5k PBMC TSSe (output of metrics.tsse tool)
  4. galaxy-eye Inspect the .png output

TSSe plot against number of unique fragments. High-quality cells can be identified in the plot of TSSe scores against a number of unique fragments for each cell.

Question
  1. Where are high-quality cells located in the plot?
  2. Based on this plot, how should the filter be set?
  1. The cells in the upper right are high-quality cells, enriched for TSS. Fragments in the lower left represent low-quality cells or empty droplets and should be filtered out.
  2. Setting the minimum number of counts at 5,000 and the minimum TSS enrichment to 10.0 is an adequate filter.

Filtering

Based on the TSSe plot the cells can be filtered by TSSe and fragment counts.

Hands-on: Filter cells
  1. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'
      • param-file “Annotated data matrix”: Anndata 5k PBMC TSSe (output of metrics.tsse tool)
      • “Minimum number of counts required for a cell to pass filtering”: 5000
      • “Minimum TSS enrichemnt score required for a cell to pass filtering”: 10.0
      • “Maximum number of counts required for a cell to pass filtering”: 100000
  2. Rename the generated file to Anndata 5k PBMC TSSe filtered or add the tag galaxy-tags filtered to the dataset
  3. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4564 × 0
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse'
     uns: 'reference_sequences'
     obsm: 'fragment_paired'
    
    1. How have the observations changed, compared to the first Anndata 5k PBMC AnnData file?
    2. What does this tell us about the quality of the data?
    1. There are only 4564 observations, compared to the initial 14232 observations.

      And the obs: 'tsse' has been added (but already during metrics.tssetool)

    2. The empty droplets and low-quality cells have been filtered out, leaving us with 4564 high-quality cells.

Feature selection

Currently, our AnnData matrix does not contain any variables. The variables will be added in the following step with the function pp.add_tile_matrix. This creates a cell-by-bin matrix containing insertion counts across genome-wide 500-bp bins.

After creating the variables, the most accessible features are selected.

Hands-on: Select features
  1. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Generate cell by bin count matrix, using 'pp.add_tile_matrix'
      • param-file “Annotated data matrix”: Anndata 5k PBMC TSSe filtered (output of pp.filter_cells tool)
      • “The size of consecutive genomic regions used to record the counts”: 500
      • “The strategy to compute feature counts”: "insertion": based on the number of insertions that overlap with a region of interest
  2. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Perform feature selection, using 'pp.select_features'
      • param-file “Annotated data matrix”: Anndata tile_matrix (output of pp.add_tile_matrix tool)
      • “Number of features to keep”: 250000
    Comment: Select features
    • Including more features improves resolution and can reveal finer details, but it may also introduce noise.
      • To optimize results, experiment with the n_features parameter to find the most appropriate value for your dataset.
    • To demonstrate the differences when selecting features, the following UMAP plots are the outputs from processing with a number of features between 1,000 and 500,000.
    • Fewer features result in fewer, but larger clusters. Selecting a lot of features will output more granular clusters and the compute time will increase.
    Different number of features UMAP. Open image in new tab

    Figure 2: UMAP plots with different selected features
  3. Rename the generated file to Anndata 5k PBMC select_features or add the tag galaxy-tags select_features to the dataset
  4. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4564 × 6062095
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse'
     var: 'count', 'selected'
     uns: 'reference_sequences'
     obsm: 'fragment_paired'
    
    1. How did n_vars change compared to Anndata 5k PBMC TSSe filtered
    2. Where are the selected features stored in the count matrix?
    1. There are 6,062,095 variables, compared to 0 in TSSe filtered.
    2. Selected features are stored in var: 'selected'.

Doublet removal

Doublets are removed by calling a customized scrublet (Wolock et al. 2019) algorithm. pp.scrublet will identify potential doublets and the function pp.filter_doublets removes them.

  • During single-cell sequencing, multiple nuclei can be encapsulated into the same 10x gel bead. The resulting multiplets (>97% of which are doublets) produce sequences from both cells.
  • Doublets can confound the results by appearing as “new” clusters or artifactual intermediary cell states.
    • These problematic doublets are called neotypic doublets, since they appear as “new” cell types.
  • Scrublet (Single-cell Remover of Doublets) is an algorithm which can detect neotypic doublets that produce false results.
    • The algorithm first simulates doublets by combining random pairs of observed cell features.
    • The observed features of the “cells” are then compared to the simulated doublets and scored on their doublet probability.
    • SnapATAC2’s pp.filter_doublets then removes all cells with a doublet probability >50%.
Doublet removal with scrublet. Open image in new tab

Figure 3: Scrublet simulates expected doublets and produces doublet scores for each cell. (Wolock et al. 2019)
Hands-on: Scrublet
  1. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Compute probability of being a doublet using the scrublet algorithm, using 'pp.scrublet'
      • param-file “Annotated data matrix”: Anndata 5k PBMC select_features (output of pp.select_features tool)
  2. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Remove doublets according to the doublet probability or doublet score, using 'pp.filter_doublets'
      • param-file “Annotated data matrix”: Anndata scrublet (output of pp.scrublet tool)
      • “Threshold for doublet probability”: 0.5
  3. Rename the generated file to Anndata 5k PBMC filter_doublets or add the tag galaxy-tags filter_doublets to the dataset
  4. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4430 × 6062095
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score'
     var: 'count', 'selected'
     uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score'
     obsm: 'fragment_paired'
    
    1. What was removed by pp.filter_doublets?
    2. Where are the new annotations stored?
    1. Cell doublets were removed. n_obs was reduced from 4564 to 4430 cells.
    2. The outputs of pp.scrublet are stored in observations obs: 'doublet_probability', 'doublet_score' and in unstructured annotations uns: 'scrublet_sim_doublet_score'. The output of pp.filter_doublets is stored in uns: 'doublet_rate'.

Dimension reduction

Dimension reduction is a very important step during the analysis of single cell data. During this, the complex multi-dimensional data is projected into lower-dimensional space, while the lower-dimensional embedding of the complex data retains as much information as possible. Dimension reduction enables batch correction, data visualization and quicker downstream analysis since the data is more simplified and the memory usage is reduced (Zhang et al. 2024).

  • Dimension reduction algorithms can be either linear or non-linear.
  • Linear methods are generally computationally efficient and well scalable.

    A popular linear dimension reduction algorithm is:

    • PCA (Principle Component Analysis), implemented in Scanpy (please check out our Scanpy tutorial for an explanation).
  • Nonlinear methods however are well suited for multimodal and complex datasets.
    • in contrast to linear methods, which often preserve global structures, non-linear methods have a locality-preserving character.
    • This makes non-linear methods relatively insensitive to outliers and noise while emphasizing natural clusters in the data (Belkin and Niyogi 2003)
    • As such, they are implemented in many algorithms to visualize the data in 2 dimensions (f.ex. UMAP embedding).
  • The nonlinear dimension reduction algorithm, through matrix-free spectral embedding, used in SnapATAC2 is a very fast and memory efficient non-linear algorithm (Zhang et al. 2024).
    • Spectral embedding utilizes an iterative algorithm to calculate the spectrum (eigenvalues and eigenvectors) of a matrix without computing the matrix itself.
  • For a simple introduction into spectral embedding and how it compares to PCA, please check out the blog post “On Laplacian Eigenmaps for Dimensionality Reduction” by Juan Orduz.

Spectral embedding

The dimension reduction, produced by the algorithm tl.spectral, is required for later steps, such as plotting and clustering.

Hands-on: Spectral embedding
  1. SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Dimension reduction and Clustering”: Perform dimension reduction using Laplacian Eigenmap, using 'tl.spectral'
      • param-file “Annotated data matrix”: Anndata 5k PBMC filter_doublets (output of pp.filter_doublets tool)
      • “Distance metric”: cosine
    Comment: Distance metric
    • The fast and well scalable “matrix-free spectral embedding” algorithm depends on the distance metric: cosine
  2. Rename the generated file to Anndata 5k PBMC spectral or add the tag galaxy-tags spectral to the dataset
  3. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4430 × 6062095
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score'
     var: 'count', 'selected'
     uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
     obsm: 'fragment_paired', 'X_spectral'
    

    Where are the new annotations stored?

    The outputs of tl.spectral are stored in unstructured annotations uns: 'spectral_eigenvalue' and as multidimensional observations obsm: 'X_spectral'.

UMAP embedding

With the already reduced dimensionality of the data stored in X_spectral, the cells can be further embedded (i.e. transformed into lower dimensions) with Uniform Manifold Approximation and Projection (UMAP). UMAP projects the cells and their relationship to each other into 2-dimensional space, which can be easily visualized (McInnes et al. 2018).

Hands-on: UMAP embedding
  1. SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Dimension reduction and Clustering”: Compute Umap, using 'tl.umap'
      • param-file “Annotated data matrix”: Anndata 5k PBMC spectral (output of tl.spectral tool)
      • “Use the indicated representation in ‘.obsm’“: X_spectral
  2. Rename the generated file to Anndata 5k PBMC UMAP or add the tag galaxy-tags UMAP to the dataset

Clustering

During clustering, cells that share similar accessibility profiles are organized into clusters. SnapATAC2 utilizes graph-based community clustering with the Leiden algorithm (Traag et al. 2019). This method takes the k-nearest neighbor (KNN) graph as input data and produces well-connected communities.

Community clustering

Hands-on: Clustering analysis
  1. SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Dimension reduction and Clustering”: Compute a neighborhood graph of observations, using 'pp.knn'
      • param-file “Annotated data matrix”: Anndata 5k PBMC UMAP (output of tl.umap tool)
  2. SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Dimension reduction and Clustering”: Cluster cells into subgroups, using 'tl.leiden'
      • param-file “Annotated data matrix”: Anndata knn (output of pp.knn tool)
      • “Whether to use the Constant Potts Model (CPM) or modularity”: modularity
    Comment: CPM or modularity
    • make sure you selected modularity
    • the clusters produced by CPM are not represented well in the UMAP projections
  3. Rename the generated file to Anndata 5k PBMC leiden or add the tag galaxy-tags leiden to the dataset
  4. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4564 × 6062095
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden'
     var: 'count', 'selected'
     uns: 'doublet_rate', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
     obsm: 'X_spectral', 'X_umap', 'fragment_paired'
     obsp: 'distances'
    

    Where are the leiden clusters stored in the AnnData?

    The clusters are stored in obs: 'leiden'

Plotting of clusters

Now that we have produced UMAP embeddings of our cells and have organized the cells into leiden clusters, we can now visualize this information with pl.umap.

Hands-on: Plotting the clusters
  1. SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for plotting”: Plot the UMAP embedding, using 'pl.umap'
      • param-file “Annotated data matrix”: Anndata 5k PBMC leiden (output of tl.leiden tool)
      • “Color”: leiden
      • “Height of the plot”: 500
  2. galaxy-eye Inspect the .png output

umap_leiden_clustering.

Question
  1. How many leiden clusters were discovered?
  2. What does the distance of clusters to each other tell us about their chromatin states?
  1. There are 13 leiden clusters.
  2. Clusters in close proximity (f.ex. clusters 0 and 5) share a similar chromatin accessibility profile (and very likely also a similar cell type), compared to a cluster further away (f.ex. cluster 9).

Cell cluster annotation

After clustering the cells, they must be annotated. This categorizes the clusters into known cell types. Manual Cell Annotation requires known marker genes and varying expression profiles of the marker genes among clusters.

Luckily, the marker genes for PBMCs are known (Sun et al. 2019). Marker genes for other single cell datasets can also be found in databases such as PanglaoDB (Franzén et al. 2019). Using the known marker genes we can now annotate our clusters manually.

Gene matrix

Since our data currently doesn’t contain gene information, we have to create a cell-by-gene activity matrix using the function pp.make_gene_matrix.

Hands-on: Task description
  1. SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
    • “Method used for preprocessing”: Generate cell by gene activity matrix, using 'pp.make_gene_matrix'
      • param-file “Annotated data matrix”: Anndata 5k PBMC leiden (output of tl.leiden tool)
      • param-file “GTF/GFF file containing the gene annotation”: gene_annotation.gtf.gz (Input dataset)
  2. Rename the generated file to Anndata 5k PBMC gene_matrix or add the tag galaxy-tags gene_matrix to the dataset
    • Please note that pp.make_gene_matrix removes all annotations except those stored in obs.
    • Therefore it might be necessary to remove propagating tags galaxy-tags (tags starting with #) from Anndata 5k PBMC gene_matrix.
      • Tags can be removed by expanding the dataset with a tag and clicking galaxy-cross next to the tag.
  3. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4430 × 60606
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden'
    

    What does n_vars represent in Anndata 5k PBMC gene_matrix and what did it represent in Anndata 5k PBMC leiden?

    • The variables now represent accessible genes. There are 60606 accessible genes in our samples. In Anndata 5k PBMC leiden and all earlier AnnData the variables represented the 6062095 fixed-sized genomic bins.

Imputation with Scanpy and MAGIC

Similar to scRNA-seq data, the cell-by-gene-activity matrix is very sparse. Additionally, high gene variance between cells, due to technical confounders, could impact the downstream analysis. In scRNA-seq, filtering and normalization are therefore required to produce a high-quality gene matrix.

Since the cell-by-gene-activity matrix resembles the cell-by-gene-expression matrix of scRNA-seq, we can use the tools of the Scanpy (Wolf et al. 2018) tool suite to continue with our data.

  • The count matrices of single-cell data are sparse and noisy.
  • Confounding issues, such as “dropout” effects, where some mRNA or DNA-segments are not detected although they are present in the cell, also result in some cells missing important cell-type defining features.
    • These problems can obscure the data, as only the strongest gene-gene relationships are still detectable.
  • The Markov Affinity-based Graph Imputation of Cells (MAGIC) algorithm (van Dijk et al. 2018) tries to solve these issues by filling in missing data from some cells with transcript information from similar cells.
    • The algorithm calculates the likely gene expression of a single cell, based on similar cells and fills in the missing data to produce the expected expression.
      • MAGIC achieves this by building a graph from the data and using data diffusion to smooth out the noise.
Imputation with the MAGIC algorithm. Open image in new tab

Figure 4: MAGIC restores noisy and sparse single-cell data using diffusion geometry (van Dijk et al. 2018)
Hands-on: Filter and normalize
  1. Filter ( Galaxy version 1.9.6+galaxy3) with the following parameters:
    • param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix (output of pp.make_gene_matrix tool)
    • “Method used for filtering”: Filter genes based on number of cells or counts, using 'pp.filter_genes'
      • param-select “Filter”: Minimum number of cells expressed
        • “Minimum number of cells expressed required for a gene to pass filtering”: 5
  2. Normalize ( Galaxy version 1.9.6+galaxy3) with the following parameters:
    • param-file “Annotated data matrix”: Anndata filter_genes (output of Filter tool)
    • “Method used for normalization”: Normalize counts per cell, using 'pp.normalize_total'
    • param-toggle “Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell”: No
  3. Inspect and manipulate ( Galaxy version 1.9.6+galaxy3) with the following parameters:
    • param-file “Annotated data matrix”: Anndata normalize (output of Normalize tool)
    • “Method used for inspecting”: Logarithmize the data matrix, using 'pp.log1p'
  4. Normalize ( Galaxy version 1.9.6+galaxy3) with the following parameters:
    • param-file “Annotated data matrix”: Anndata log1p (output of log1p tool)
    • “Method used for normalization”: Denoising using Markov Affinity-based Graph Imputation of Cells (MAGIC) API 'external.pp.magic'
      • “Denoised genes to return”: All genes
      • “Which solver to use”: "approximate", is faster that performs imputation in the PCA space and then projects back to the gene space
    Comment:
    • Choosing the setting Which solver to use: 'exact' will result in a output file with better resolution.
    • This is not necessary for our purposes, since the compute time also increases with this setting.
  5. Rename the generated file to Anndata 5k PBMC magic or add the tag galaxy-tags magic to the dataset

  6. galaxy-eye Inspect the general information of the .h5ad output

    Question
    AnnData object with n_obs × n_vars = 4430 × 55106
     obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden', 'n_genes', 'n_counts'
     var: 'n_cells', 'n_counts'
     uns: 'log1p'
    
    1. How did n_vars change, compared to Anndata 5k PBMC gene_matrix?
    2. Which data was ‘lost’, compared to Anndata 5k PBMC leiden, and must be added to the file in order to produce UMAP plots?
    1. The number of genes was reduced from 60606 to 55106 by the filtering. Additional annotations were added, such as: obs: 'n_genes', 'n_counts', var: 'n_cells', 'n_counts' and uns: 'log1p'.
    2. The UMAP embeddings obsm: 'X_umap' are missing and should be added to the Anndata in the next step. Without X_umap it won’t be possible to visualize the plots.

Copy-over embeddings

Hands-on: Copy UMAP embedding
  1. AnnData Operations ( Galaxy version 1.9.3+galaxy0) with the following parameters:
    • param-file “Input object in hdf5 AnnData format”: Anndata 5k PBMC magic (output of external.pp.magic tool)
    • “Copy embeddings (such as UMAP, tSNE)”: Yes
      • “Keys from embeddings to copy”: X_umap
      • param-file “IAnnData objects with embeddings to copy”: Anndata 5k PBMC leiden
    Comment: Annotations to copy
    • This tutorial only focuses on producing an UMAP plot with marker-genes.
    • If further analysis, with tools requiring more annotations, is intended, these can be added in a similar way as shown above.
      • f.ex. Peak and Motif Analysis with Snapatac2 peaks and motif ( Galaxy version 2.6.4+galaxy1) requires annotations from uns.
    • It is also possible to leave the input “Keys from embeddings to copy” empty, to copy all annotations of a given category such as obsm.
  2. Rename the generated file to Anndata 5k PBMC magic UMAP or add the tag galaxy-tags UMAP to the dataset

  3. galaxy-eye Inspect the general information of the .h5ad output, to check if obsm contains X_umap
  AnnData object with n_obs × n_vars = 4430 × 55106
   obs: obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'leiden', 'n_genes', 'n_counts'
   var: 'n_cells', 'n_counts'
   uns: 'log1p'
   obsm: 'X_umap'

Visualize gene activity of marker genes

The gene activity of selected marker genes can now be visualized with Scanpy.

Hands-on: Plot marker genes
  1. Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:
    • param-file “Annotated data matrix”: output_h5ad (output of AnnData Operations tool)
    • “Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
      • “Keys for annotations of observations/cells or variables/genes”: leiden, CD3D, CD8A, CD4, MS4A1, NKG7, CD14, FCER1A
      • param-toggle “Show edges?”: No
      • In “Plot attributes”
        • “Number of panels per row”: 2

umap_leiden_marker_gene_clustering.

Question
  1. Are the marker genes selectively expressed in the clusters?
  2. Which marker genes have overlapping expression profiles? And what could that imply?
  1. Some marker genes, such as MS4A1 or CD8A, are only expressed in a few clusters (clusters 6+11 and clusters 1+8, respectively).
  2. The marker gene CD3D is expressed in multiple clusters (1, 2, 4, 7 and 8). Overlapping expression profiles imply similar cell types since similar cell types have similar marker genes upregulated. In this case, CD3D expression classifies the cells in these clusters as T-cells.

Manual cluster annotation

Comparison of marker gene expression in our clusters with a table of canonical marker genes (Sun et al. 2019), enables us to annotate the clusters manually.

Cell type Marker genes
CD8+ T cells CD3D+, CD8A+, CD4-
CD4+ T cells CD3D+, CD8A-, CD4+
B cells CD3D-, MS4A1+
Natural killer (NK) cells CD3D-, NKG7+
Monocytes CD3D-, CD14+
Dendritic cells CD3D-, FCER1A+

These canonical marker genes can match the clusters to known cell types:

Cluster Cell type
0 Monocytes
1 CD8+ T cells
2 CD4+ T cells
3 Monocytes
4 CD4+ T cells
5 Monocytes
6 B cells
7 CD4+ T cells
8 CD8+ T cells
9 NK cells
10 Monocytes
11 B cells
12 Dendritic cells
Comment

Note that some clusters contain subtypes (f.ex. the annotated B cell clusters contain both naive and memory B cells). The cell-type annotation can be refined by choosing more specific marker genes.

To manually annotate the Leiden clusters, we will need to perform multiple steps:

  1. Inspect the key-indexed observations of Anndata 5k PBMC gene_matrix magic UMAP
  2. Cut the Leiden annotations out of the table
  3. Upload a replace file containing the new cell type annotations for the Leiden clusters
  4. Replace the values of the cluster annotation with cell type annotation
  5. Add the cell type annotation to the AnnData
  6. Plot the annotated cell types
Hands-on: Manual annotation
  1. Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix magic UMAP
    • “What to inspect?”: Key-indexed observations annotation
  2. Rename the generated file to 5k PBMC observations
  3. galaxy-eye Inspect the generated file

    Question

    In which column is the Leiden annotation located?

    The Leiden annotation is in column 8.

    Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9 Column 10
    ”” n_fragment frac_dup frac_mito tsse doublet_probability doublet_score leiden n_genes n_counts
    AAACGAAAGACGTCAG-1 22070 0.5219425551271499 0.0 30.43315066436454 0.004634433324822066 0.009276437847866418 8 52303 16521.599844068267
    AAACGAAAGATTGACA-1 10500 0.5345125681606597 0.0 29.10551296093465 0.004668403569267374 0.001088139281828074 1 54501 15020.42495602328
    AAACGAAAGGGTCCCT-1 19201 0.5101785714285714 0.0 19.90011850347046 0.004634433324822066 0.009276437847866418 5 54212 16294.751533305309
    AAACGAACAATTGTGC-1 13242 0.487399837417257 0.0 29.060913705583758 0.004660125753854076 0.0022172949002217295 7 53530 15456.629863655084
  4. Cut columns ( Galaxy version 1.0.2) with the following parameters:
    • param-select “Cut columns”: c8
    • param-file “From”: 5k PBMC observations (output of Inspect AnnData tool)
  5. Create a new tabular file from the following
    leiden cell_type
    0 Monocytes
    1 CD8_Tcells
    2 CD4_Tcells
    3 Monocytes
    4 CD4_Tcells
    5 Monocytes
    6 Bcells
    7 CD4_Tcells
    8 CD8_Tcells
    9 NKcells
    10 Monocytes
    11 Bcells
    12 Dendritic_cells
    
    • Click galaxy-upload Upload Data at the top of the tool panel
    • Select galaxy-wf-edit Paste/Fetch Data at the bottom
    • Paste the file contents into the text field

    • Change the dataset name from “New File” to replace_file
    • Change Type from “Auto-detect” to tabular* From the Settings menu (galaxy-gear) select Convert spaces to tabs* Press Start and Close the window

    • The first column of the replace file contains the “old” annotations and the second column contains the “new” annotation.
    • warning Spaces in the new annotations can lead to errors. Please use underscores (_) instead.
  6. Replace column ( Galaxy version 0.2) with the following parameters:
    • param-file “File in which you want to replace some values”: Cut columns leiden (output of Cut columns tool)
    • param-file “Replace information file”: replace_file
    • “Which column should be replaced?: Column: 1
  7. Rename the generated file to Cell type annotation and galaxy-eye inspect the file to check if the replacement was successful
    • check if the datatype is set to tabular
    • you may need to change the datatype manually

      • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
      • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
      • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
        • Tip: you can start typing the datatype into the field to filter the dropdown menu
      • Click the Save button

  8. Manipulate AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix magic UMAP
    • “Function to manipulate the object”: Add new annotation(s) for observations or variables
      • “What to annotate”: Observations (obs)
      • param-file “Table with new annotations”: Cell type annotation(output of Replace column tool)
  9. Rename the generated file to Anndata 5k PBMC gene_matrix magic cell_type or add the tag galaxy-tags cell_type to the dataset

  10. Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:
    • param-file “Annotated data matrix”: Anndata 5k PBMC gene_matrix magic cell_type (output of Manipulate AnnData tool)
    • “Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
    • “Keys for annotations of observations/cells or variables/genes”: cell_type
    • param-toggle “Show edges?”: No
    • In “Plot attributes”
      • “Location of legend”: on data
      • param-toggle “Draw a frame around the scatter plot?”: Yes
  11. galaxy-eye Inspect the .png output

UMAP annotated cell types.

Question
  1. Are clusters with the same assigned cell type located close to each other?
  2. Are the expected cell type percentages of PBMCs visible in the annotated plot?

    Cell Type Expected Percentage
    CD3+ T cells 50-70%
    CD4+ T cells 25-45%
    CD8+ T cells 10-25%
    B cells 5-15%
    NK cells <5%
    Monocytes 10-30%
    Dendritic cells <2%
  1. Yes. B-cells are far away from NK and T cells. Only the myeloid lineage of monocytes and dendritic cells are located close to each other. This might be due to the common progenitor cell lineage and thus a similar chromatin profile.
  2. Yes. The most common cell types are T cells, followed by Monocytes and B cells. NK cells and Dendritic cells only make up a small percentage of PBMCs.

Conclusion

congratulations Well done, you’ve made it to the end! You might want to consult your results with this control history, or check out the full workflow for this tutorial.

In this tutorial, we produced a count matrix of scATAC-seq reads in the AnnData format and performed:

  1. Preprocessing:
    1. Plotting the fragment-size distributions
    2. Calculating and plotting TSSe scores
    3. Filtering cells and selecting features (fixed-size genomic bins)
  2. Dimension reduction through Spectral embedding and UMAP
  3. Clustering of cells via the Leiden method
  4. Cluster annotation
    1. Producing and filtering a cell by gene activity matrix
    2. Data normalization and imputation with Scanpy and the MAGIC algorithm
    3. Visualizing marker genes in the clusters
    4. Manually annotating the cell types with selected marker genes

SnapATAC2 processing pipeline.

The scATAC-seq analysis can now continue with downstream analysis, for example differential peak analysis.

The SnapATAC2 tools for differential peak analysis are already accessible on Galaxy. However, there are no GTN trainings available yet. Until such a tutorial is uploaded, you can visit the SnapATAC2 documentation for a tutorial on differential peak analysis. And check out our example history and the exemplary workflow.

The tools are available in Galaxy under SnapATAC2 Peaks and Motif ( Galaxy version 2.6.4+galaxy1).

  • If you want to continue with differential peak analysis, please make sure that the AnnData object with the annotated cell types contains unspecified annotations for the reference sequences (uns: 'reference_sequences').
    • The section Copy-over embeddings explains how to copy annotations from one AnnData object to another.