Combining single cell datasets after pre-processing

Overview
Creative Commons License: CC-BY Questions:
  • I have some AnnData files from different samples that I want to combine into a single file. How can I combine these and label them within the object?

Objectives:
  • Combine data matrices from different samples in the same experiment

  • Label the metadata for downstream processing

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Sep 8, 2022
Last modification: Dec 14, 2023
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00246
version Revision: 28

This tutorial will take you from the multiple AnnData outputs of the previous tutorial to a single, combined AnnData object, ready for all the fun downstream processing. We will also look at how to add in metadata (for instance, SEX or GENOTYPE) for analysis later on.

Agenda

In this tutorial, we will cover:

  1. Get Data
  2. Important tips for easier analysis
    1. Concatenating objects
  3. Adding batch metadata
  4. Mitochondrial reads
  5. Pulling single cell data from public resources
  6. Conclusion

Get Data

The sample data is a subset of the reads in a mouse dataset of fetal growth restriction Bacon et al. 2018 (see the study in Single Cell Expression Atlas and the project submission). Each of the 7 samples (N701 –> N707) has been run through the workflow from the Alevin tutorial.

You can access the data for this tutorial in multiple ways:

  1. Your own history - If you’re feeling confident that you successfully ran a workflow on all 7 samples from the previous tutorial, and that your resulting 7 AnnData objects look right (you can compare with the answer key history), then you can use those! To avoid a million-line history, I recommend dragging the resultant datasets into a fresh history

    There 3 ways to copy datasets between histories

    1. From the original history

      1. Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel
      2. Click on Copy Datasets
      3. Select the desired files

      4. Give a relevant name to the “New history”

      5. Validate by ‘Copy History Items’
      6. Click on the new history name in the green box that have just appear to switch to this history
    2. Using the galaxy-columns Show Histories Side-by-Side

      1. Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)
      2. Click on galaxy-columns Show Histories Side-by-Side
      3. If your target history is not present
        1. Click on ‘Select histories’
        2. Click on your target history
        3. Validate by ‘Change Selected’
      4. Drag the dataset to copy from its original history
      5. Drop it in the target history
    3. From the target history

      1. Click on User in the top bar
      2. Click on Datasets
      3. Search for the dataset to copy
      4. Click on its name
      5. Click on Copy to current History

  2. Importing from a history

    1. Open the link to the shared history
    2. Click on the new-history Import history button on the top right
    3. Enter a title for the new history
    4. Click on Import

galaxy-eye Inspect the param-file Experimental Design text file. This shows you how each N70X corresponds to a sample, and whether that sample was from a male or female. This will be important metadata to add to our sample, which we will add very similarly to how you added the gene_name and mito metadata previously!

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Did you know we have a unique Galaxy instance with all our single cell tools highlighted to make it easier to use? We recommend this instance for all your single cell analysis needs, particularly for newer users.

This subdomain is uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!

Try it out! All your histories/workflows/logins from the general European Galaxy instance will be there!

Concatenating objects

Hands-on: Concatenating AnnData objects
  1. Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
    • param-file “Annotated data matrix”: N701-400k
    • “Function to manipulate the object”: Concatenate along the observations axis
    • param-file “Annotated data matrix to add”: Select all the other matrix files from bottom to top, N707 to N702
    Warning: N707 to N702!

    You are adding files to N701, so do not add N701 to itself!

    • “Join method”: Intersection of variables
    • “Key to add the batch annotation to obs”: batch
    • “Separator to join the existing index names with the batch category”: -
  2. Rename galaxy-pencil output Combined Object

Now let’s look at what we’ve done! Unfortunately, AnnData objects are quite complicated, so the galaxy-eye won’t help us too much here. Instead, we’re going to use a tool to look into our object from now on.

Hands-on: Inspecting AnnData Objects
  1. Inspect AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “What to inspect?”: General information about the object
  2. Inspect AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “What to inspect?”: Key-indexed observations annotation (obs)
  3. Inspect AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “What to inspect?”: Key-indexed annotation of variables/features (var)

Now have a look at the three tool Inspect AnnData outputs.

Question
  1. How many cells do you have now?
  2. Where is batch information stored?
  1. If you look at the General information tool output, you can see there are now 338 cells, as the matrix is now 338 cells x 35734 genes. You can see this as well in the obs tool (cells) and var tool (genes) file sizes.
  2. Under Key-indexed observations annotation (obs). Different versions of the Manipulate tool will put the batch columns in different locations. The tool version in this course puts batch in the 8th column. Batch refers to the order in which the matrices were added. The files are added from the bottom of the history upwards, so be careful how you set up your histories when running this (i.e. if your first dataset is N703 and the second is N701, the batch will call N703 0 and N701 1!)

Adding batch metadata

I set up the example history with the earliest indices at the bottom.

The files are numbered such that dataset #1 is N701, dataset #2 is N702, etc., up through #7 as N707. This puts N707 at the top and N701 at the bottom in the Galaxy history.Open image in new tab

Figure 1: Correct history ordering for combining datasets in order

Therefore, when it is all concatenated together, the batch appears as follows:

Index Batch Genotype Sex
N701 0 wildtype male
N702 1 knockout male
N703 2 knockout female
N704 3 wildtype male
N705 4 wildtype male
N706 5 wildtype male
N707 6 knockout male

If you used Zenodo to import files, they may not have imported in order (i.e. N701 to N707, ascending). In that case, you will need to tweak the parameters of the next tools appropriately to label your batches correctly!

The two critical pieces of metadata in this experiment are sex and genotype. I will later want to color my cell plots by these parameters, so I want to add them in now!

Hands-on: Labelling sex
  1. Replace Text in a specific column ( Galaxy version 1.1.3) with the following parameters:
    • param-file “File to process”: output of Inspect AnnData: Key-indexed observations annotation (obs) tool)
    • “1. Replacement”

      • “in column”: Column: 8 - or whichever column batch is in
      • “Find pattern”: 0|1|3|4|5|6
      • “Replace with”: male
    • + Insert Replacement
    • “2. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: 2
      • “Replace with”: female
    • + Insert Replacement
    • “3. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: batch
      • “Replace with”: sex

    Run this tool - we will use the output in Step 2.

    From the output of Step 1 we want only the single column containing the sex information - we will ultimately add this into the cell annotation in the AnnData object.

  2. Cut columns from a table with the following parameters:
    • “Cut columns”: c8
    • “Delimited by”: Tab
    • param-file “From”: output of Replace text tool
  3. Rename galaxy-pencil output Sex metadata

That was so fun, let’s do it all again but for genotype!

Hands-on: Labelling genotype
  1. Replace Text in a specific column ( Galaxy version 1.1.3) with the following parameters:
    • param-file “File to process”: output of Inspect AnnData: Key-indexed observations annotation (obs) tool
    • “1. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: 0|3|4|5
      • “Replace with”: wildtype
    • + Insert Replacement
    • “2. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: 1|2|6
      • “Replace with”: knockout
    • + Insert Replacement
    • “3. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: batch
      • “Replace with”: genotype

    Now we want only the column containing the genotype information - we will ultimately add this into the cell annotation in the AnnData object.

  2. Cut columns from a table with the following parameters:
    • “Cut columns”: c8
    • “Delimited by”: Tab
    • param-file “From”: output of Replace text tool
  3. Rename galaxy-pencil output Genotype metadata

You might want to do this with all sorts of different metadata - which labs handled the samples, which days they were run, etc. Once you’ve added all your metadata columns, we can add them together before plugging them into the AnnData object itself.

Hands-on: Combining metadata columns
  1. Paste two files side by side with the following parameters:
    • param-file “Paste”: Genotype metadata
    • param-file “and”: Sex metadata
    • “Delimit by”: Tab
  2. Rename galaxy-pencil output Cell Metadata

Let’s add it to the AnnData object!

Hands-on: Adding metadata to AnnData object
  1. Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “Function to manipulate the object”: Add new annotation(s) for observations or variables
    • “What to annotate?”: Observations (obs)`
    • param-file “Table with new annotations”: Cell Metadata

Woohoo! We’re there! You can run an Inspect AnnData ( Galaxy version 0.7.5+galaxy1) to check now, but I want to clean up this AnnData object just a bit more first. It would be a lot nicer if ‘batch’ meant something, rather than ‘the order in which the Manipulate AnnData tool added my datasets’.

Hands-on: Labelling batches
  1. Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
    • param-file “Annotated data matrix”: output of Manipulate AnnData - Add new annotations tool
    • “Function to manipulate the object”: Rename categories of annotation
    • “Key for observations or variables annotation”: batch
    • “Comma-separated list of new categories”: N701,N702,N703,N704,N705,N706,N707
  2. Rename galaxy-pencil output Batched Object

Huzzah! We are JUST about there. However, while we’ve been focussing on our cell metadata (sample, batch, genotype, etc.) to relabel the ‘observations’ in our object…

Mitochondrial reads

Do you remember when we mentioned mitochondria early on in this tutorial? And how often in single cell samples, mitochondrial RNA is often an indicator of stress during dissociation? We should probably do something with our column of true/false in the gene annotation that tells us information about the cells. You will need to do this whether you have combined FASTQ files or are analysing just one.

Hands-on: Calculating mitochondrial RNA in cells
  1. AnnData Operations ( Galaxy version 1.8.1+galaxy0) with the following parameters:
    • param-file “Input object in hdf5 AnnData format”: Batched Object
    • “Format of output object”: AnnData format
    • “Gene symbols field in AnnData”: NA.
    • “Flag genes that start with these names”: Insert Flag genes that start with these names
    • “Starts with”: True
    • “Var name”: mito
  2. Rename galaxy-pencil output Annotated Object

congratulationsWell done! I strongly suggest have a play with the Inspect AnnData tool on your final Pre-processed object to see the wealth of information that has been added. You are now ready to move along to further filtering! There is a cheat that may save you time in the future though…

Pulling single cell data from public resources

If you happen to be interested in analysing publicly available data, particularly from the Single Cell Expression Atlas, you may be interested in the following tool Moreno et al. 2020 which combines all these steps into one! For this tutorial, the dataset can be seen at the EBI with experiment id of E-MTAB-6945.

Hands-on: Retrieving data from Single Cell Expression Atlas
  1. EBI SCXA Data Retrieval ( Galaxy version v0.0.2+galaxy2) with the following parameters:
    • “SC-Atlas experiment accession”: E-MTAB-6945
    • “Choose the type of matrix to download”: Raw filtered counts

    Now we need to transform this into an AnnData objects

  2. Scanpy Read10x ( Galaxy version 1.8.1+galaxy0) with the following parameters:
    • “Expression matrix in sparse matrix format (.mtx)”: EBI SCXA Data Retrieval on E-MTAB-6945 matrix.mtx (Raw filtered counts)
    • “Gene table”: EBI SCXA Data Retrieval on E-MTAB-6945 genes.tsv (Raw filtered counts)
    • “Barcode/cell table”: EBI SCXA Data Retrieval on E-MTAB-6945 barcodes.tsv (Raw filtered counts)
    • “Cell metadata table”: EBI SCXA Data Retrieval on E-MTAB-6945 exp_design.tsv

It’s important to note that this matrix is processed somewhat through the SCXA pipeline, which is quite similar to this tutorial, and it contains any and all metadata provided by their pipeline as well as the authors (for instance, more cell or gene annotations).

Conclusion

Combining data files. Open image in new tab

Figure 2: Workflow - Combining datasets

You’ve reached the end of this session! You may be interested in seeing an example history and workflow. Note that the workflow will require changing of the column containing the batch metadata depending on how you are running it. The final object containing the total the reads can be found in this Galaxy History on UseGalaxy EU.

To discuss with like-minded scientists, join our Matrix/Element chatroom to discuss with fellow users of Galaxy single cell analysis tools!

Matrix

We also post new tutorials / workflows there from time to time, as well as any other news.