Combining single cell datasets after pre-processing

Author(s) orcid logoWendi Bacon avatar Wendi BaconJonathan Manning avatar Jonathan Manning
Editor(s) orcid logoHelena Rasche avatar Helena Rasche
Tester(s) orcid logoJulia Jakiela avatar Julia Jakiela
Reviewers Pavankumar Videm avatarSaskia Hiltemann avatarBjörn Grüning avatarMarisa Loach avatarWendi Bacon avatarHelena Rasche avatarBérénice Batut avatarMehmet Tekman avatarPablo Moreno avatarJulia Jakiela avatar
Overview
Creative Commons License: CC-BY Questions:
  • I have some AnnData files from different samples that I want to combine into a single file. How can I combine these and label them within the object?

Objectives:
  • Combine data matrices from different samples in the same experiment

  • Label the metadata for downstream processing

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Sep 8, 2022
Last modification: Oct 28, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00246
rating Rating: 4.0 (1 recent ratings, 1 all time)
version Revision: 19

This tutorial will take you from the multiple AnnData outputs of the previous tutorial to a single, combined AnnData object, ready for all the fun downstream processing. We will also look at how to add in metadata (for instance, SEX or GENOTYPE) for analysis later on.

Agenda

In this tutorial, we will cover:

  1. Get Data
  2. Important tips for easier analysis
    1. Concatenating objects
  3. Adding batch metadata
  4. Mitochondrial reads
  5. Pulling single cell data from public resources
  6. Conclusion

Get Data

The sample data is a subset of the reads in a mouse dataset of fetal growth restriction Bacon et al. 2018 (see the study in Single Cell Expression Atlas and the project submission). Each of the 7 samples (N701 –> N707) has been run through the workflow from the Alevin tutorial.

You can access the data for this tutorial in multiple ways:

  1. Your own history - If you’re feeling confident that you successfully ran a workflow on all 7 samples from the previous tutorial, and that your resulting 7 AnnData objects look right (you can compare with the answer key history), then you can use those! To avoid a million-line history, I recommend dragging the resultant datasets into a fresh history

    There 3 ways to copy datasets between histories

    1. From the original history

      1. Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel
      2. Click on Copy Datasets
      3. Select the desired files

      4. Give a relevant name to the “New history”

      5. Validate by ‘Copy History Items’
      6. Click on the new history name in the green box that have just appear to switch to this history
    2. Using the galaxy-columns Show Histories Side-by-Side

      1. Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)
      2. Click on galaxy-columns Show Histories Side-by-Side
      3. If your target history is not present
        1. Click on ‘Select histories’
        2. Click on your target history
        3. Validate by ‘Change Selected’
      4. Drag the dataset to copy from its original history
      5. Drop it in the target history
    3. From the target history

      1. Click on User in the top bar
      2. Click on Datasets
      3. Search for the dataset to copy
      4. Click on its name
      5. Click on Copy to current History

  2. Importing from a history
    1. Open the link to the shared history
    2. Click on the Import this history button on the top left
    3. Enter a title for the new history
    4. Click on Copy History

    • If you want to import the history to another Galaxy server, check how to do it below!

    Transfer a Single Dataset

    At the sender Galaxy server, set the history to a shared state, then directly capture the galaxy-link link for a dataset and paste the URL into the Upload tool at the receiver Galaxy server.

    Transfer an Entire History

    Have an account at two different Galaxy servers, and be logged into both.

    At the sender Galaxy server

    1. Navigate to the history you want to transfer, and set the history to a shared state.
    2. Click into the History Options menu in the history panel.
    3. Select from the menu galaxy-history-archive Export History to File.
    4. Choose the option for How do you want to export this History? as to direct download.
    5. Click on Generate direct download.
    6. Allow the archive generation process to complete. *
    7. Copy the galaxy-link link for your new archive.

    At the receiver Galaxy server

    1. Confirm that you are logged into your account.
    2. Click on Data in the top menu, and choose Histories to reach your Saved Histories.
    3. Click on Import history in the grey button on the top right.
    4. Paste in your link’s URL from step 7.
    5. Click on Import History.
    6. Allow the archive import process to complete. *
    7. The transfered history will be uncompressed and added to your Saved Histories.

    * For steps 6 and 13: It is Ok to navigate away for other tasks during processing. If enabled, Galaxy will send you status notifications.

    tip If the history to transfer is large, you may copy just your important datasets into a new history, and create the archive from that new smaller history. Clearing away deleted and purged datasets will make all histories smaller and faster to archive and transfer!

  3. Uploading from Zenodo (see below)
Hands-on: Data upload for 7 files
  1. Create a new history for this tutorial (if you’re not importing the history above)
  2. Import the different AnnData files and the experimental design table from Zenodo.

    https://zenodo.org/records/10852529/files/Experimental_Design.tabular.tabular
    https://zenodo.org/records/10852529/files/N701-400k-AnnData.h5ad
    https://zenodo.org/records/10852529/files/N702-400k-AnnData.h5ad
    https://zenodo.org/records/10852529/files/N703-400k-AnnData.h5ad
    https://zenodo.org/records/10852529/files/N704-400k-AnnData.h5ad
    https://zenodo.org/records/10852529/files/N705-400k-AnnData.h5ad
    https://zenodo.org/records/10852529/files/N706-400k-AnnData.h5ad
    https://zenodo.org/records/10852529/files/N707-400k-AnnData.h5ad
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets
  4. Check that the AnnData files datatype is h5ad, otherwise you will need to change each file to h5ad!

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

galaxy-eye Inspect the param-file Experimental Design text file. This shows you how each N70X corresponds to a sample, and whether that sample was from a male or female. This will be important metadata to add to our sample, which we will add very similarly to how you added the gene_name and mito metadata previously!

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!

Try it out!

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

  1. Expand the red history dataset by clicking on it.
    • Sometimes you can already see an error message here
  2. View the error message by clicking on the bug icon galaxy-bug

  3. Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
    • Expand the history item
    • Click on the details icon
    • Scroll down to the Job Information section to view the 2 logs:
      • Tool Standard Output
      • Tool Standard Error
    • For more information about specific tool errors, please see the Troubleshooting section
  4. Submit a bug report! If you are still unsure what the problem is.
    • Click on the bug icon galaxy-bug
    • Write down any information you think might help solve the problem
      • See this FAQ on how to write good bug reports
    • Click galaxy-bug Report button
  5. Ask for help!

Concatenating objects

Hands-on: Concatenating AnnData objects
  1. Manipulate AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: N701-400k
    • “Function to manipulate the object”: Concatenate along the observations axis
    • param-file “Annotated data matrix to add”: Select all the other matrix files from bottom to top, N702 to N707
    Comment

    If you imported files from Zenodo instead of using the input history, yours might not be in the same order as ours. Since the files will be concatenated in the order that you click, it will be helpful if you click them in the same order, from N702 to N707. This will ensure your samples are given the same batch numbers as we got in this tutorial, which will help when we’re adding in metadata later!

    Warning: Don't add N701!

    You are adding files to N701, so do not add N701 to itself!

    • “Join method”: Intersection of variables
    • “Key to add the batch annotation to obs”: batch
    • “Separator to join the existing index names with the batch category”: -
  2. Rename galaxy-pencil output Combined Object

Now let’s look at what we’ve done! Unfortunately, AnnData objects are quite complicated, so the galaxy-eye won’t help us too much here. Instead, we’re going to use a tool to look into our object from now on.

Hands-on: Inspecting AnnData Objects
  1. Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “What to inspect?”: General information about the object
  2. Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “What to inspect?”: Key-indexed observations annotation (obs)
  3. Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “What to inspect?”: Key-indexed annotation of variables/features (var)

Now have a look at the three tool Inspect AnnData outputs.

Question
  1. How many cells do you have now?
  2. Where is batch information stored?
  1. If you look at the General information tool output, you can see there are now 331 cells, as the matrix is now 331 cells x 35734 genes. You can see this as well in the obs tool (cells) and var tool (genes) file sizes.
  2. Under Key-indexed observations annotation (obs). Different versions of the Manipulate tool might put the batch columns in different locations. The tool version in this course puts batch in the 8th column. Batch refers to the order in which the matrices were added. The files are added from the bottom of the history upwards, so be careful how you set up your histories when running this (i.e. if your first dataset is N703 and the second is N701, the batch will call N703 0 and N701 1!)

Adding batch metadata

I set up the example history with the earliest indices at the bottom.

The files are numbered such that dataset #1 is N701, dataset #2 is N702, etc., up through #7 as N707. This puts N707 at the top and N701 at the bottom in the Galaxy history.Open image in new tab

Figure 1: Correct history ordering for combining datasets in order

Therefore, when it is all concatenated together, the batch appears as follows:

Index Batch Genotype Sex
N701 0 wildtype male
N702 1 knockout male
N703 2 knockout female
N704 3 wildtype male
N705 4 wildtype male
N706 5 wildtype male
N707 6 knockout male

If you used Zenodo to import files, they may not have imported in order (i.e. N701 to N707, ascending). In that case, you will need to tweak the parameters of the next tools appropriately to label your batches correctly!

The two critical pieces of metadata in this experiment are sex and genotype. I will later want to color my cell plots by these parameters, so I want to add them in now!

Hands-on: Labelling sex
  1. Replace Text in a specific column ( Galaxy version 9.3+galaxy0) with the following parameters:
    • param-file “File to process”: output of Inspect AnnData: Key-indexed observations annotation (obs) tool)
    • “1. Replacement”

      • “in column”: Column: 8 - or whichever column batch is in
      • “Find pattern”: 0|1|3|4|5|6
      • “Replace with”: male
    • + Insert Replacement
    • “2. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: 2
      • “Replace with”: female
    • + Insert Replacement
    • “3. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: batch
      • “Replace with”: sex

    Run this tool - we will use the output in Step 2.

    From the output of Step 1 we want only the single column containing the sex information - we will ultimately add this into the cell annotation in the AnnData object.

  2. Cut columns from a table with the following parameters:
    • “Cut columns”: c8
    • “Delimited by”: Tab
    • param-file “From”: output of Replace text tool
  3. Rename galaxy-pencil output Sex metadata

That was so fun, let’s do it all again but for genotype!

Hands-on: Labelling genotype
  1. Replace Text in a specific column ( Galaxy version 9.3+galaxy0) with the following parameters:
    • param-file “File to process”: output of Inspect AnnData: Key-indexed observations annotation (obs) tool
    • “1. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: 0|3|4|5
      • “Replace with”: wildtype
    • + Insert Replacement
    • “2. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: 1|2|6
      • “Replace with”: knockout
    • + Insert Replacement
    • “3. Replacement”

      • “in column”: Column: 8
      • “Find pattern”: batch
      • “Replace with”: genotype

    Now we want only the column containing the genotype information - we will ultimately add this into the cell annotation in the AnnData object.

  2. Cut columns from a table with the following parameters:
    • “Cut columns”: c8
    • “Delimited by”: Tab
    • param-file “From”: output of Replace text tool
  3. Rename galaxy-pencil output Genotype metadata

You might want to do this with all sorts of different metadata - which labs handled the samples, which days they were run, etc. Once you’ve added all your metadata columns, we can add them together before plugging them into the AnnData object itself.

Hands-on: Combining metadata columns
  1. Paste two files side by side with the following parameters:
    • param-file “Paste”: Genotype metadata
    • param-file “and”: Sex metadata
    • “Delimit by”: Tab
  2. Rename galaxy-pencil output Cell Metadata

Let’s add it to the AnnData object!

Hands-on: Adding metadata to AnnData object
  1. Manipulate AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: Combined object
    • “Function to manipulate the object”: Add new annotation(s) for observations or variables
    • “What to annotate?”: Observations (obs)`
    • param-file “Table with new annotations”: Cell Metadata

Woohoo! We’re there! You can run an Inspect AnnData ( Galaxy version 0.10.3+galaxy0) to check now, but I want to clean up this AnnData object just a bit more first. It would be a lot nicer if ‘batch’ meant something, rather than ‘the order in which the Manipulate AnnData tool added my datasets’.

Hands-on: Labelling batches
  1. Manipulate AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
    • param-file “Annotated data matrix”: output of Manipulate AnnData - Add new annotations tool
    • “Function to manipulate the object”: Rename categories of annotation
    • “Key for observations or variables annotation”: batch
    • “Comma-separated list of new categories”: N701,N702,N703,N704,N705,N706,N707
  2. Rename galaxy-pencil output Batched Object

Huzzah! We are JUST about there. However, while we’ve been focussing on our cell metadata (sample, batch, genotype, etc.) to relabel the ‘observations’ in our object…

Mitochondrial reads

Do you remember when we mentioned mitochondria early on in this tutorial? And how often in single cell samples, mitochondrial RNA is often an indicator of stress during dissociation? We should probably do something with our column of true/false in the gene annotation that tells us information about the cells. You will need to do this whether you have combined FASTQ files or are analysing just one.

Hands-on: Calculating mitochondrial RNA in cells
  1. AnnData Operations ( Galaxy version 1.9.3+galaxy0) with the following parameters:
    • param-file “Input object in hdf5 AnnData format”: Batched Object
    • “Format of output object”: AnnData format
    • “Gene symbols field in AnnData”: NA.
    • “Flag genes that start with these names”: Insert Flag genes that start with these names
    • “Starts with”: True
    • “Var name”: mito
  2. Rename galaxy-pencil output Annotated Object

congratulationsWell done! I strongly suggest have a play with the Inspect AnnData tool on your final Pre-processed object to see the wealth of information that has been added. You are now ready to move along to further filtering! There is a cheat that may save you time in the future though…

Pulling single cell data from public resources

If you happen to be interested in analysing publicly available data, particularly from the Single Cell Expression Atlas, you may be interested in the following tool Moreno et al. 2020 which combines all these steps into one! For this tutorial, the dataset can be seen at the EBI with experiment id of E-MTAB-6945. It’s important to note that this matrix is processed somewhat through the SCXA pipeline, which is quite similar to this tutorial, and it contains any and all metadata provided by their pipeline as well as the authors (for instance, more cell or gene annotations).

If you wish, you can have a closer look how to pull this dataset from the Single Cell Expression Atlas by following another tutorial: Importing files from public atlases.

Conclusion

Combining data files. Open image in new tab

Figure 2: Workflow - Combining datasets

You’ve reached the end of this session! You may be interested in seeing an example history and workflow. Note that the workflow might require changing of the column containing the batch metadata depending on how you are running it and checking the order of the combined datasets. The final object containing the total the reads can be found in this Galaxy History on UseGalaxy EU.

feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users

We also post new tutorials / workflows there from time to time, as well as any other news.

point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.

tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet