Generating a single cell matrix using Alevin
Author(s) | Wendi Bacon Jonathan Manning |
Editor(s) | Helena Rasche |
Tester(s) | Julia Jakiela |
OverviewQuestions:Objectives:
I have some single cell FASTQ files I want to analyse. Where do I start?
Requirements:
Generate a cellxgene matrix for droplet-based single cell sequencing data
Interpret quality control (QC) plots to make informed decisions on cell thresholds
Find relevant information in GTF files for the particulars of their study, and include this in data matrix metadata
- Introduction to Galaxy Analyses
- slides Slides: An introduction to scRNA-seq data analysis
- tutorial Hands-on: Understanding Barcodes
Time estimation: 2 hoursSupporting Materials:
- Datasets
- Workflows
- galaxy-history-input Input Histories
- galaxy-history-answer Answer Histories
- FAQs
- video Recordings
- instances Available on these Galaxies
Published: Mar 3, 2021Last modification: Oct 28, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00245rating Rating: 4.0 (1 recent ratings, 5 all time)version Revision: 19
This tutorial will take you from raw FASTQ files to a cell x gene data matrix in AnnData format. What’s a data matrix, and what’s AnnData format? Well you’ll find out! Importantly, this is the first step in processing single cell data in order to start analysing it. Currently you have a bunch of strings of ATGGGCTT
etc. in your sequencing files, and what you need to know is how many cells you have and what genes appear in those cells. These steps are the most computationally heavy in the single cell world, as you’re starting with 100s of millions of reads, each with 4 lines of text. Later on in analysis, this data becomes simple gene counts such as ‘Cell A has 4 GAPDHs’, which is a lot easier to store! Because of this data overload, we have downsampled the FASTQ files to speed up the analysis a bit. Saying that, you’re still having to map loads of reads to the massive murine genome, so get yourself a cup of coffee and prepare to analyse!
AgendaIn this tutorial, we will cover:
Generating a matrix
In this section, we will show you the principles of the initial phase of single-cell RNA-seq analysis: generating expression measures in a matrix. We’ll concentrate on droplet-based (rather than plate-based) methodology, since this is the process with most differences with respect to conventional approaches developed for bulk RNA-seq.
Droplet-based data consists of three components: cell barcodes, unique molecular identifiers (UMIs) and cDNA reads. To generate cell-wise quantifications we need to:
- Process cell barcodes, working out which ones correspond to ‘real’ cells, which to sequencing artefacts, and possibly correct any barcodes likely to be the product of sequencing errors by comparison to more frequent sequences.
- Map biological sequences to the reference genome or transcriptome.
- ‘De-duplicate’ using the UMIs.
This used to be a complex process involving multiple algorithms, or was performed with technology-specific methods (such as 10X’s ‘Cellranger’ tool) but is now much simpler thanks to the advent of a few new methods. When selecting methodology for your own work you should consider:
- STARsolo - a droplet-based scRNA-seq-specific variant of the popular genome alignment method STAR. Produces results very close to those of Cellranger (which itself uses STAR under the hood).
- Kallisto/ bustools - developed by the originators of the transcriptome quantification method, Kallisto.
- Alevin - another transcriptome analysis method developed by the authors of the Salmon tool.
We’re going to use Alevin Srivastava et al. 2019 for demonstration purposes, but we do not endorse one method over another.
Get Data
We’ve provided you with some example data to play with, a small subset of the reads in a mouse dataset of fetal growth restriction Bacon et al. 2018 (see the study in Single Cell Expression Atlas and the project submission). This is a study using the Drop-seq chemistry, however this tutorial is almost identical to a 10x chemistry. We will point out the one tool parameter change you will need to run 10x samples. This data is not carefully curated, standard tutorial data - it’s real, it’s messy, it desperately needs filtering, it has background RNA running around, and most of all it will give you a chance to practice your analysis as if this data were yours.
Down-sampled reads and some associated annotation can be imported below. How did I downsample these FASTQ files? Check out this history to find out!
Additionally, to map your reads, you will need a transcriptome to align against (a FASTA) as well as the gene information for each transcript (a gtf) file. You can download these for your species of interest from Ensembl. These files are included in the data import step below. Keep in mind, these are big files, so the fastest way to get these into your Galaxy account is through importing them by history.
Hands-on: Option 1: Data upload - Import history
Import history from: example input history
- Open the link to the shared history
- Click on the Import this history button on the top left
- Enter a title for the new history
- Click on Copy History
Rename galaxy-pencil the the history to your name of choice.
Hands-on: Option 2: Data upload - Add to history
- Create a new history for this tutorial
Import the Experimental Design table, sequencing reads 1 & 2, the GTF and fasta files from Zenodo
https://zenodo.org/record/4574153/files/Experimental_Design.tabular https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.100.gtf.gff https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.cdna.all.fa.fasta https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k.fastq https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
- Rename galaxy-pencil the datasets: Change
SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k
toN701-Read1
andSLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq
toN701-Read2
.
QuestionHave a look at the files you now have in your history.
- Which of the FASTQ files do you think contains the barcode sequences?
- Given the chemistry this study should have, are the barcode/UMI reads the correct length?
- What is the ‘N701’ referring to?
- Read 1 (SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k) contains the cell barcode and UMI because it is significantly shorter (indeed, 20 bp!) compared to the longer, r_2 transcript read. For ease, it’s better to rename these files N701-Read1 and N701-Read2.
- You can see Read 1 is only 20 bp long, which for original Drop-Seq is 12 bp for cell barcode and 8 bp for UMI. This is correct! Be warned - 10x Chromium (and many technologies) change their chemistry over time, so particularly when you are accessing public data, you want to check and make sure you have your numbers correct!
- ‘N701’ is referring to an index read. This sample was run alongside 6 other samples, each denoted by an Illumina Nextera Index (N70X). Later, this will tell you batch information. If you look at the ‘Experimental Design’ file, you’ll see that the N701 sample was from a male wildtype neonatal thymus.
Important tips for easier analysis
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
- Open your Galaxy server
- Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
- Navigate to your tutorial
- Tool names in tutorials will be blue buttons that open the correct tool for you
- Note: this does not work for all tutorials (yet)
- You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
- We’ve had some issues with Tutorial mode on Safari for Mac users.
- Try a different browser if you aren’t seeing the button.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!
Try it out!
- subdomain Europe: Single Cell Omics Lab
- subdomain USA: Single Cell Omics Lab
- subdomain Australia: Single Cell Omics Lab
When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.
- Expand the red history dataset by clicking on it.
- Sometimes you can already see an error message here
View the error message by clicking on the bug icon galaxy-bug
- Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
- Expand the history item
- Click on the details icon
- Scroll down to the Job Information section to view the 2 logs:
- Tool Standard Output
- Tool Standard Error
- For more information about specific tool errors, please see the Troubleshooting section
- Submit a bug report! If you are still unsure what the problem is.
- Click on the bug icon galaxy-bug
- Write down any information you think might help solve the problem
- See this FAQ on how to write good bug reports
- Click galaxy-bug Report button
- Ask for help!
- Where?
- In the User community chatspace in Slack in our #single-cell-users channel
- In the GTN Matrix Channel
- In the Galaxy Matrix Channel
- Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).
- When asking for help, it is useful to share a link to your history
Generate a transcript to gene map
Gene-level, rather than transcript-level, quantification is standard in scRNA-seq, which means that the expression level of alternatively spliced RNA molecules are combined to create gene-level values. Droplet-based scRNA-seq techniques only sample one end each transcript, so lack the full-molecule coverage that would be required to accurately quantify different transcript isoforms.
To generate gene-level quantifications based on transcriptome quantification, Alevin and similar tools require a conversion between transcript and gene identifiers. We can derive a transcript-gene conversion from the gene annotations available in genome resources such as Ensembl. The transcripts in such a list need to match the ones we will use later to build a binary transcriptome index. If you were using spike-ins, you’d need to add these to the transcriptome and the transcript-gene mapping.
In your example data you will see the murine reference annotation as retrieved from Ensembl in GTF format. This annotation contains gene, exon, transcript and all sorts of other information on the sequences. We will use these to generate the transcript/ gene mapping by passing that information to a tool that extracts just the transcript identifiers we need.
QuestionWhich of the ‘attributes’ in the last column of the GTF files contains the transcript and gene identifiers?
The file is organised such that the last column (headed ‘Group’) contains a wealth of information in the format: attribute1 “information associated with attribute 1”;attribute2 “information associated with attribute 2” etc.
gene_id and transcript_id are each followed by “ensembl gene_id” and “ensembl transcript_id”
It’s now time to parse the GTF file using the rtracklayer package in R. This parsing will give us a conversion table with a list of transcript identifiers and their corresponding gene identifiers for counting. Additionally, because we will be generating our own binary index (more later!), we also need to input our FASTA so that it can be filtered to only contain transcriptome information found in the GTF.
Hands-on: Generate a filtered FASTA and transcript-gene map
- GTF2GeneList ( Galaxy version 1.52.0+galaxy0) with the following parameters:
- param-file “Ensembl GTF file”:
GTF file in the history
galaxy-history- “Feature type for which to derive annotation”:
transcript
(Your sequences are transcript sequencing, so this is your starting point)- “Field to place first in output table”:
transcript_id
(This is accessing the column you identified above!)- “Suppress header line in output?”:
Yes
(The next tool (Alevin) does not expect a header)- “Comma-separated list of field names to extract from the GTF (default: use all fields)”:
transcript_id,gene_id
(This calls the first column to be the transcript_id, and the second the gene_id. Thus, your key can turn transcripts into genes)- “Append version to transcript identifiers?”:
Yes
(The Ensembl FASTA files usually have these, and since we need the FASTA transcriptome and the GTF gene information to work together, we need to append these!)- “Flag mitochondrial features?”:
No
- “Provide a cDNA file for extracting annotations and/ or possible filtering?”:
Yes
- param-file “FASTA-format cDNA/transcript file”:
FASTA file in your history
galaxy-history- “Annotation field to match with sequences”:
transcript_id
- “Filter the cDNA file to match the annotations?”:
Yes
Rename galaxy-pencil the annotation table to
Map
- Rename galaxy-pencil the uncompressed filtered FASTA file to
Filtered FASTA
Generate a transcriptome index & quantify!
Alevin collapses the steps involved in dealing with dscRNA-seq into a single process. Such tools need to compare the sequences in your sample to a reference containing all the likely transcript sequences (a ‘transcriptome’). This will contain the biological transcript sequences known for a given species, and perhaps also technical sequences such as ‘spike ins’ if you have those.
To be able to search a transcriptome quickly, Alevin needs to convert the text (FASTA) format sequences into something it can search quickly, called an ‘index’. The index is in a binary rather than human-readable format, but allows fast lookup by Alevin. Because the types of biological and technical sequences we need to include in the index can vary between experiments, and because we often want to use the most up-to-date reference sequences from Ensembl or NCBI, we can end up re-making the indices quite often. Making these indices is time-consuming! Have a look at the uncompressed FASTA to see what it starts with.
We now have:
- Barcode/ UMI reads
- cDNA reads
- transcript/ gene mapping
- filtered FASTA
We can now run Alevin. In some public instances, Alevin won’t show up if you search for it. Instead, you may have to click the Single Cell tab at the left and scroll down to the Alevin tool. Alternatively, use Tutorial Mode as described above and you’ll easily navigate to all the tools, and their versions will all be the tried and tested ones of this tutorial. It’s often a good idea to check your tool versions. To identify which version of a tool you are using, select tool-versions ‘Versions’ and choose the appropriate version. In this case the tutorial was built with Alevin Galaxy Version 1.9.0+galaxy2, but will also work with the versions named in the Hands-on sections.
Hands-on: Running Alevin
Alevin ( Galaxy version 1.10.1+galaxy0)
QuestionTry to fill in the parameters of Alevin using what you know!
The Salmon documentation on ‘Fragment Library Types’ and running the Alevin command is here: salmon.readthedocs.io/en/latest/library_type.html and salmon.readthedocs.io/en/latest/alevin.html. These links will help here, although keep in mind the image there is drawn with the RNA 5’ on top, whereas in this scRNA-seq protocol, the polyA is captured by its 3’ tail and thus effectively the bottom or reverse strand…)
- “Select a reference transcriptome from your history or use a built-in index?”:
Use one from the history
- You are going to generate the binary index using your filtered FASTA!
- param-file “Transcripts FASTA file”:
Filtered FASTA
- “Single or paired-end reads?”:
Paired-end
- param-file “Mate pair 1”:
N701-Read1
- param-file “Mate pair 2”:
N701-Read2
- “Specify the strandedness of the reads”:
Infer automatically (A)
- “Type of single-cell protocol”:
DropSeq Single Cell protocol
- param-file “Transcript to gene map file”:
Map
- In “Extra output files”:
- param-check
Salmon Quant log file
param-check
Features used by the CB classification and their counts at each cell level (--dumpFeatures)
- Of course you are welcome to select more options and explore the output files (warning warning: “Per cell level parsimonious Umi graph (–dumpUmiGraph)” will generate over 2 thousand single files), but for this tutorial you will only need to select those specified.
- In “Advanced options”:
- “Dump cell v transcripts count matrix in MTX format”: galaxy-toggle
Yes
Comment: What if I'm running a 10x sample?The main parameter that needs changing for a 10X Chromium sample is the ‘Protocol’ parameter of Alevin. Just select the correct 10x Chemistry there instead.
Comment: Alevin file namesYou will notice that the names of the output files of Alevin are written in a certain convention, mentioning which tool was used and on which files, for example: “Alevin on data X, data Y, and others: whitelist”. Remember that you can always rename the files if you wish! For simplicity, when we refer to those files in the tutorial, we skip the information about tool and only use the second part of the name - in this case it would be simply “whitelist”.
This tool will take a while to run. Alevin produces many file outputs, not all of which we’ll use. You can refer to the Alevin documentation if you’re curious what they all are, but we’re most interested in is:
- the matrix itself (per-cell gene-count matrix (MTX) - the count by gene and cell)
- the row (cell/ barcode) identifiers (row index (CB-ids)) and
- the column (gene) labels (column headers (gene-ids)).
QuestionAfter you’ve run Alevin, galaxy-eye look through all the different files. Can you find:
- The Mapping Rate?
- How many cells are present in the matrix output?
- Inspect galaxy-eye the file param-file Salmon log file. You can see the mapping rate is a paltry
25.45%
. This is a terrible mapping rate. Why might this be? Remember this was downsampled, and specifically by taking only the last 400,000 reads of the FASTQ file. The overall mapping rate of the file is more like 50%, which is still quite poor, but for early Drop-Seq samples and single-cell data in general, you might expect a slightly poorer mapping rate. 10x samples are much better these days! This is real data, not test data, after all!- Inspect galaxy-eye the file param-file row index (CB-ids), and you can see it has
2163
lines. The rows refer to the cells in the cell x gene matrix. According to this (rough) estimate, your sample has 2163 cells in it!
Warning: Choose the appropriate input going forward!Make certain to use per-cell gene-count matrix (MTX) file going forward.
congratulations Congratulations - you’ve made an expression matrix! We could almost stop here. But it’s sensible to do some basic QC, and one of the things we can do is look at a barcode rank plot.
Basic QC
The question we’re looking to answer here, is: “do we mostly have a single cell per droplet”? That’s what experimenters are normally aiming for, but it’s not entirely straightforward to get exactly one cell per droplet. Sometimes almost no cells make it into droplets, other times we have too many cells in each droplet. At a minimum, we should easily be able to distinguish droplets with cells from those without.
Hands-on: Generate a raw barcode QC plot
- Droplet barcode rank plot ( Galaxy version 1.6.1+galaxy2) with the following parameters:
- “Input MTX-format matrix?”:
No
- param-file “A two-column tab-delimited file, with barcodes in the first column and frequencies in the second”: Output of Alevin
raw CB classification frequencies
- “Label to place in plot title”:
Barcode rank plot (raw barcode frequencies)
- Rename galaxy-pencil the image output
Barcode Plot - raw barcode frequencies
Now, the image generated here (400k) isn’t the most informative - but you are dealing with a fraction of the reads! If you run the total sample (so identical steps above, but with significantly more time!) you’d get the image below.
This is our own formulation of the barcode plot based on a discussion we had with community members. The left hand plots with the smooth lines are the main plots, showing the UMI counts for individual cell barcodes ranked from high to low. We expect a sharp drop-off between cell-containing droplets and ones that are empty or contain only cell debris. Now, this data is not an ideal dataset, so for perspective, in an ideal world with a very clean 10x run, data will look a bit more like the following taken from the lung atlas (see the study in Single Cell Expression Atlas and the project submission).
In that plot, you can see the clearer ‘knee’ bend, showing the cut-off between empty droplets and cell-containing droplets.
The right hand plots are density plots from the first one, and the thresholds are generated either using dropletUtils or by the method described in the discussion mentioned above. We could use any of these thresholds to select cells, assuming that anything with fewer counts is not a valid cell. By default, Alevin does something similar, and we can learn something about that by plotting just the barcodes Alevin retains.
Hands-on: Generate Alevin's barcode QC plot
- Droplet barcode rank plot ( Galaxy version 1.6.1+galaxy2) with the following parameters:
- “Input MTX-format matrix?”:
Yes
- “Matrix-market format matrix file, with cells by column (overrides –barcode-frequencies if supplied)”:
per-cell gene-count matrix (MTX)
- “For use with –mtx-matrix: force interpretation of matrix to assume cells are by row, rather than by column (default)”:
Yes
- “Label to place in plot title”:
Barcode rank plot (Alevin-processed)
- Rename galaxy-pencil the image output
Barcode Plot - Alevin processed barcodes
And the full sample looks like:
And to round this off, here’s the lung atlas plot.
You should see a completely vertical drop-off where Alevin has trunctated the distribution (after excluding any cell barcode that had <10 UMI, Alevin then chose a threshold based off the curve and removed all barcodes with fewer UMIs).
In experiments with relatively simple characteristics, this ‘knee detection’ method works relatively well. But some populations (such as our sample!) present difficulties. For instance, sub-populations of small cells may not be distinguished from empty droplets based purely on counts by barcode. Some libraries produce multiple ‘knees’ for multiple sub-populations. The emptyDrops method has become a popular way of dealing with this. emptyDrops still retains barcodes with very high counts, but also adds in barcodes that can be statistically distinguished from the ambient profiles, even if total counts are similar. In order to ultimately run emptyDrops (or indeed, whatever tool you like that accomplishes biologically relevant thresholding), we first need to re-run Alevin, but prevent it from applying its own less than ideal thresholds.
To use emptyDrops effectively, we need to go back and re-run Alevin, stopping it from applying it’s own thresholds. Click the re-run icon galaxy-refresh on any Alevin output in your history, because almost every parameter is the same as before, except you need to change the following:
Generate an unprocessed matrix in a usable format
Hands-on: Stopping Alevin from thresholding
- Alevin ( Galaxy version 1.10.1+galaxy0) (Click re-run on the last Alevin output)
- “Advanced options”
- “Fraction of cellular barcodes to keep”: ‘1’ - i.e. keep them all!
- “Minimum frequency for a barcode to be considered”: ‘3’ - This will only remove cell barcodes with a frequency of less than 3, a low bar to pass but useful way of avoiding processing a bunch of almost certainly empty barcodes
- Expand one of the output datasets of the tool (by clicking on it)
- Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
QuestionHow many cells are in the output now?
22952
cells are in the quants_mat_rows now! Far more than the Alevin-filtered2163
. This needs some serious filtering with EmptyDrops!
Alevin outputs MTX format, which we can pass to the dropletUtils package and run emptyDrops. Unfortunately the matrix is in the wrong orientation for tools expecting files like those produced by 10X software (which dropletUtils does). We need to ‘transform’ the matrix such that cells are in columns and genes are in rows.
Warning: Be careful!Don’t mix up files from the different Alevin runs! Use the later run, which has higher numbers in the history!
Hands-on: Transform matrix
- salmonKallistoMtxTo10x ( Galaxy version 0.0.1+galaxy6) with the following parameters:
- param-file “.mtx-format matrix”:
per-cell gene-count matrix (MTX)
(output of Alevin tool)- param-file “Tab-delimited genes file”:
column headers (gene-ids)
(output of Alevin tool)- param-file “Tab-delimited barcodes file”:
row index (CB-ids)
(output of Alevin tool)- Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:genes’ to
Gene table
- Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:barcodes’ to
Barcode table
- Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:matrix’ to
Matrix table
The output is a matrix in the correct orientation for the rest of our tools. However, our matrix is looking a bit sparse - for instance, click on Gene table
. I don’t know about you, but I’d struggle to have a good biological discussion using only Ensembl gene_ids! What I’d really like is the more understandable ‘GAPDH’ or other gene acronym, as well as information on mitochondrial genes so that I can assess if my cells were stressed out or not. In order to prepare our data for emptyDrops, we’re going to combine this information into an object, and it’s easiest to add in that information now.
Adding in Gene metadata
QuestionWhere can we find this gene information?
Our old friend the GTF file!
QuestionWhich of the ‘attributes’ in the last column of that file contains the gene acronym?
gene_name
We’re now going to re-run galaxy-refresh the tool that extracts information from our GTF file.
Hands-on: Generate gene information
- GTF2GeneList ( Galaxy version 1.52.0+galaxy0) with the following parameters:
- “Feature type for which to derive annotation”:
gene
- “Field to place first in output table”:
gene_id
- “Suppress header line in output?”:
Yes
- “Comma-separated list of field names to extract from the GTF (default: use all fields)”:
gene_id,gene_name,mito
- “Append version to transcript identifiers?”:
Yes
- “Flag mitochondrial features?”:
Yes
- note, this will auto-fill a bunch of acronyms for searching in the GTF for mitochondrial associated genes. This is good!- “Filter the cDNA file to match the annotations?”:
No
- we don’t need to, we’re done with the FASTA!- Check that the output file type is
tabular
. If not, change the file type by clicking the ‘Edit attributes’galaxy-pencil on the dataset in the history (as if you were renaming the file.) Then clickDatatypes
and type intabular
. ClickChange datatype
.)- Rename galaxy-pencil the annotation table to
Gene Information
Inspect galaxy-eye the Gene Information object in the history. Now you have made a new key for gene_id, with gene name and a column of mitochondrial information (false = not mitochondrial, true = mitochondrial). We need to add this information into the salmonKallistoMtxTo10x output ‘Gene table’. But we need to keep ‘Gene table’ in the same order, since it is referenced in the ‘Matrix table’ by row.
Hands-on: Combine MTX Gene Table with Gene Information
- Join two Datasets with the following parameters:
- “Join”:
Gene Table
- “Using column”:
Column: 1
- “with”:
Gene Information
- “and column”:
Column: 1
- “Keep lines of first input that do not join with second input”:
Yes
- “Keep lines of first input that are incomplete”:
Yes
- “Fill empty columns”:
No
- “Keep the header lines”:
No
If you inspect galaxy-eye the object, you’ll see we have joined these tables and now have quite a few gene_id repeats. Let’s take those out, while keeping the order of the original ‘Gene Table’.
- Cut columns from a table with the following parameters:
- “Cut columns”:
c1,c4,c5
- “Delimited by”:
Tab
- “From”: output of Join two Datasets tool
- Rename output
Annotated Gene Table
Inspect galaxy-eye your Annotated Gene Table
. That’s more like it! You now have gene_id
, gene_name
, and mito
. Now let’s get back to your journey to emptyDrops and sophisticated thresholding of empty droplets!
emptyDrops
emptyDrops Lun et al. 2019 works with a specific form of R object called a SingleCellExperiment. We need to convert our transformed MTX files into that form, using the DropletUtils Read10x tool:
Hands-on: Converting to SingleCellExperiment format
- DropletUtils Read10x ( Galaxy version 1.0.4+galaxy0) with the following parameters:
- param-file “Expression matrix in sparse matrix format (.mtx)”:
Matrix table
- param-file “Gene Table”:
Annotated Gene Table
- param-file “Barcode/cell table”:
Barcode table
- “Should metadata file be added?”:
No
- Rename galaxy-pencil output:
SCE Object
Fantastic! Now that our matrix is combined into an object, specifically the SingleCellExperiment format, we can now run emptyDrops! Let’s get rid of those background droplets containing no cells!
Hands-on: Emptydrops
- DropletUtils emptyDrops ( Galaxy version 1.0.4+galaxy0) with the following parameters:
- param-file “SingleCellExperiment rdata object”:
SCE Object
- “Should barcodes estimated to have no cells be removed from the output object?”:
Yes
Rename galaxy-pencil
serialised SingleCellExperiment
output asEmptied-Object
- Rename galaxy-pencil
tabular output
asEmptied-Tabular Output
QuestionHow many cell barcodes remain after the emptyDrops treatment? Why might that be?
If you click on the
Emptied-Object
in the galaxy-history history, the text in that window says38 barcodes
or something similar to that - there is an element of random in the algorithm, so yours might differ slightly. Why is this so low?? And why might the number be different? Consider…is this a complete set of data?Remember this is a subsampled dataset. If you look carefully at the parameters of emptyDrops, you’ll see it set a minimum threshold at 100 UMI. If you look at the barcode plots above for the 400k read sample, you’ll see this is far too stringent for this subsampled data! To satisfy your curiosity, this minimum threshold would yield
4332
barcodes for the total sample. Also, the number may vary slightly as the output depends on a large number of random iterations.
We will nevertheless proceed with your majestic annotated expression matrix of 38 cells, ready to go for further processing and analysis! However, the next tutorials we will link to use a tool suite called Scanpy Wolf et al. 2018. You need to convert this SingleCellExperiment object into a format called annData
, which is a variant of a file format called hdf5
.
Hands-on: Converting to AnnData format
- SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
- “Convert From / To”:
SingleCellExperiment to AnnData
- param-file “Input object in sce,rds,rdata.sce format”:
Emptied-Object
If the dataset does not show up in the corresponding input field or displays as ‘unavailable’, don’t worry - try dragging the dataset from the history panel and dropping it into the input field. If this still doesn’t work, then you can change the datatype to rdata.sce.
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select your desired datatype from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
- Rename galaxy-pencil output
N701-400k-AnnData
congratulations Congrats! Your object is ready to for the scanpy pipeline! You can can check your work against the example history. You can also compare how the subsampled datasets you’ve generated compare with the total sample.
However, it may be that you want to combine this object with others like it, for instance, maybe you ran 5 samples, and you are starting with 10 FASTQ files…
Analysing multiple FASTQ files
This sample was originally one of seven. So to run the other 12 downsampled FASTQ files, you can use a workflow and run them all at once! All these samples are going to take a while, so go and have several cups of tea… Or, better yet, I have run them myself. To combine the resultant files into a single matrix, you can look at the next tutorial in this case study: Combining datasets after pre-processing
Mitochondrial flagging
We have assumed you will be combining multiple files - but if that’s not the case, you’ll need to perform this step to turn your column of true
and false
labelling the mitochondrial genes into some metrics telling you the % of mitochondrial genes in each cell. You can follow that step here: Mitochondrial calculations.
Conclusion
We have:
- Examined raw read data, annotations and necessary input files for quantification.
- Run Alevin in two different parameterisations, both allowing Alevin to make its own calls on what constitutes empty droplets, and applying emptyDrops instead.
- Deployed barcode rank plots as a way of quickly assessing the signal present in droplet datasets.
- Applied the necessary conversion to pass these data to downstream processes.
feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users
We also post new tutorials / workflows there from time to time, as well as any other news.
point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.
tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet