Differential abundance testing of small RNAs

Overview

question Questions
• What small RNAs are expressed?
• What RNA features have significantly different numbers of small RNAs targeting them between two conditions?
objectives Objectives
• Process small RNA-seq datasets to determine quality and reproducibility.
• Filter out contaminants (e.g. rRNA reads) in small RNA-seq datasets.
• Differentiate between subclasses of small RNAs based on their characteristics.
• Identify differently abundant small RNAs and their targets.
requirements Requirements

time Time estimation: 3h

Introduction

Small, noncoding RNA (sRNA) molecules, typically 18-40nt in length, are key features of post-transcriptional regulatory mechanisms governing gene expression. Through interactions with protein cofactors, these tiny sRNAs typically function by perfectly or imperfectly basepairing with substrate RNA molecules, and then eliciting downstream effects such as translation inhibition or RNA degradation. Different subclasses of sRNAs - e.g. microRNAs (miRNAs), Piwi-interaction RNAs (piRNAs), and endogenous short interferring RNAs (siRNAs) - exhibit unique characteristics, and their relative abundances in biological contexts can indicate whether they are active or not. In this tutorial, we will examine expression of the piRNA subclass of sRNAs and their targets in Drosophila melanogaster.

The data used in this tutorial are from polyphosphatase-treated sRNA sequencing (sRNA-seq) experiments in Drosophila. The goal of this study was to determine how siRNA expression changes in flies treated with RNA interference (RNAi) to knock down Symplekin, which is a component of the core cleaveage completx. To that end, in addition to sRNA-seq, mRNA-seq experiments were performed to determine whether targets of differentially expressed siRNAs were also differentially expressed. The original published study can be found here. Because of the long processing time for the large original files - which contained 7-22 million reads each - we have downsampled the original input data to include only a subset of usable reads.

Analysis strategy

In this exercise we will identify what small RNAs are present in Drosophila Dmel-2 tissue culture cells treated with either RNAi against core cleavage complex component Symplekin or blank (control) RNAi (published data available in GEO at GSE82128). In this study, biological triplicate sRNA- and mRNA-seq libraries were sequenced for both RNAi conditions. After removing contaminant ribosomal RNA (rRNA) and miRNA reads, we will identify and quantify endogenous siRNAs from sequenced reads and test for differential abundance. We will follow a popular small RNA analysis pipeline developed for piRNAs by the Phillip Zamore Lab and the ZLab (Zhiping Weng) at UMass Med School called PiPipes. Although PiPipes was developed for analysis of piRNAs, many of the basical principles can be applied to other classes of small RNAs.

It is of note that this tutorial uses datasets that have been de-multiplexed so that the input data files are a single FASTQ formatted file for each sample. Because sRNAs are typically much smaller than fragments generated for RNA-seq or other types of deep sequencing experiments, single-end sequencing strategies are almost always used to sequence sRNAs. This tutorial uses the Collections feature of Galaxy to orgainze each set of replicates into a single group, making tool form submission easier and ensuring reproducibility.

Agenda

In this tutorial, we will deal with:

Due to the large size of the original sRNA-seq datasets, we have downsampled them to only inlcude a subset of reads. These datasets are avaialble at Zenodo where you can find the FASTQ files corresponding to replicate sRNA-seq experiments and additional annotation files for the Drosophila melanogaster genome version dm3.

hands_on Hands-on: Data upload and organization

1. Create a new history and name it something meaningful (e.g. sRNA-seq tutorial)
2. Open the Data Upload Manager by selecting Get Data from the Tool Panel and clicking Upload File
3. Select Paste/Fetch Data
4. Copy each link for the read (.fastqsanger), annotation (.tab), and reference sequence (.fa) files, and paste each link into a separate text field
• Set the datatype of the read (.fastqsanger) files to fastq
• Set the datatype of the annotation (.tab) file to tab and assign the Genome as dm3
• Set the datatype of the reference (.fa) files to fasta and assign the Genome as dm3
5. Click Start
6. Rename the files in your history to something meaningful (e.g. Blank_RNAi_sRNA_rep1.fastq)
7. Build a Dataset list for each set of replicate FASTQ files
• Click the Operations on multiple datasets check box at the top of the history panel
• Check the three boxes next to the blank RNAi (control) sRNA-seq samples
• Click For all selected… and choose Build dataset list
• Ensure that only the three blank samples are selected, and enter a name for the new collection (e.g. Blank RNAi sRNA-seq)
• Click Create list
• Repeat for the three Symplekin RNAi samples

Read quality scores (phred scores) in FASTQ-formatted data can be encoded by one of a few different encoding schemes. Most Galaxy tools assume that input FASTQ files are using the Sanger/Illumina 1.9 encoding scheme. If the input FASTQ files are using an alternate encoding scheme, then some tools will not interpret the quality score encodings correctly. It is good practice to confirm the quality encoding scheme of your data and then convert to Sanger/Illumina 1.9, if necessary. We can check the quality encoding scheme using the FastQC tool (further described in the NGS-QC tutorial).

hands_on Hands-on: Quality checking

1. FastQC tool: Run FastQC on each collection of FASTQ read files to assess the overall read/base quality and quality score encoding scheme using the following parameters:
• Short read data from your current history: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset collection
2. Repeat for the Symplekin RNAi dataset collection

question Questions

1. What quality score encoding scheme is being used for each sample?
2. What is the read length for each sample?
3. What does the base/read quality look like for each sample?

solution Solution

1. All samples use the Illumina 1.9 quality encoding scheme, so we do not need to convert.
2. All samples have a read length of 50 nt.
3. The base quality across the entire length of the reads is good (phred score > 28 for the most part).
4. Yes, the “Illumina Universal Adapter” is present.

THE FOLLOWING STEP IS NOT NEEDED FOR THIS TUTORIAL. But, if your data are not in Sanger / Illumina 1.9 format, then complete the following step.

1. FASTQ Groomer tool: Run FASTQ Groomer on each collection of FASTQ read files to convert the quality scores from Illumina 1.5 to Sanger/Illumina 1.9 encoding using the following parameters:
• File to groom: Click the “Dataset collection” tab and then select the control RNAi sRNA-seq dataset collection
• Input FASTQ quality scores type: Illumina 1.3-1.7

If you are following this tutorial OR your data were already in Sanger / Illumina 1.9 format: Changes the datatype of each input file to ‘fastqsanger’.

If your data were not in Sanger / Illumina 1.9 format: After FASTQ Groomer finishes, click on the groomed dataset collection name and then click on the name of one of the datasets. You should see that the format is fastqsanger instead of fastq, meaning you have successfully converted the quality score encoding scheme.

Go back to the FASTQC output and scroll down to the “Adapter Content” section. You can see that Illumina Universal Adapters are present in >80% of the reads. The next step is to remove these artificial adaptors because they are not part of the biological sRNAs. If a different adapter is present, you can update the Adapter sequence to be trimmed off in the Trim Galore! step described below.

sRNA-seq library preparation involves adding an artificial adaptor sequence to both the 5’ and 3’ ends of the small RNAs. While the 5’ adaptor anchors reads to the sequencing surface and thus are not sequenced, the 3’ adaptor is typically sequenced immediately following the sRNA sequence. In the example datasets here, the 3’ adaptor sequence is identified as the Illumina Universal Adapter, and needs to be removed from each read before aligning to a reference. We will be using the Galaxy tool Trim Galore! which implements the cutadapt tool for adapter trimming.

1. Trim Galore! tool: Run Trim Galore! on each collection of FASTQ read files to remove Illumina adapters from the 3’ ends of reads with the following parameters:
• Is this library paired- or single-end?: Single-end
• Reads in FASTQ format: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset
• Trim Galore! advanced settings: Full parameter list
• Overlap with adapter sequence required to trim a sequence: 6
• Generate a report file: Yes

2. Repeat for the Symplekin RNAi dataset collection

We don’t want to trim for quality because the adapter-trimmed sequences represent a full small RNA molecule, and we want to maintain the integrity of the entire molecule. We increase the minimum read length required to keep a read because small RNAs can potentially be shorter than 20 nt (the default value). We can check out the generated report file for any sample and see the command for the tool, a summary of the total reads processed and number of reads with an adapter identified, and histogram data of the length of adaptor trimmed. We also see that a very small percentage of low-quality bases have been trimmed

3. FastQC tool: Run FastQC on each collection of trimmed FASTQ read files to assess whether adapters were successfully removed.

question Questions

1. What is the read length?

solution Solution

1. The read lengths range from 12 to 50 nt after trimming.
2. No, Illumina Universal Adaptors are no longer present. No other adapters are present.

An interesting thing to note from our FastQC results is the Sequence Length Distribution results. While many RNA-seq experiments have normal distribution of read lengths, an unusual spike at 22nt is observed in our data. This spike represents the large set of endogenous siRNAs that occur in the cell line used in this study, and is actually confirmation that our dataset captures the biological molecule we are interested in.

Now that we have converted to fastqsanger format and trimmed our reads of the Illumina Universal Adaptors, we will align our trimmed reads to reference Drosophila rRNA and miRNA sequences (dm3) to remove these artifacts. Interestingly, Drosophila have a short 2S rRNA sequence that is 30nt long and typically co-migrates with sRNA populations during gel electrophoresis. rRNAs make up a very large proportion of all non-coding RNAs, and thus need to be removed. Oftentimes, experimental approaches can be utilized to deplete or avoid capture of rRNAs, but these methods are not always 100% efficient. We also want to remove any miRNA sequences, as these are not relevant to our analysis. After removing rRNA and miRNA reads, we will analyze the remaining reads, which may be siRNA or piRNA sequences.

To first identify rRNA-originating reads (which we are not interested in in this case), we will align the reads to a reference set of rRNA sequences using HISAT2, an accurate and fast tool for aligning reads to a reference.

hands_on Hands-on: Heirarchical alignment to rRNA and miRNA reference sequences

1. HISAT2 tool: Run HISAT2 on each collection of trimmed reads to align to reference rRNA sequences with the following parameters:
• Reads: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset of trimmed FASTQ files
• Source for the reference genome to align against: Use a genome from history
• Select the reference genome: Dmel_rRNA_sequences.fa
• Spliced alignment parameters: Specify spliced alignment parameters
• Specify strand-specific information: Second Strand (F/FR)

2. Repeat for the Symplekin RNAi dataset collection

We now need to extract the unaligned reads from the output BAM file for aligning to reference miRNA sequences. We can do this by using the Filter SAM or BAM, output SAM or BAM tool to obtain reads with a bit flag = 4 (meaning the read is unaligned) and then converting the filtered BAM file to FASTQ format with the Convert from BAM to FastQ tool.

3. Filter SAM or BAM, output SAM or BAM tool: Run Filter SAM or BAM, output SAM or BAM on each collection of HISAT2 output BAM files with the following parameters:
• SAM or BAM file to filter: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset of aligned HISAT2 BAM files
• Filter on bitwise flag: Yes
• Only output alignments with all of these flag bits set: Check the box next to “The read in unmapped”
4. Repeat for the Symplekin RNAi dataset collection

5. Convert from BAM to FastQ tool: Run Convert from BAM to FastQ on each collection of filtered HISAT2 output BAM files with the following parameters:
• Convert the following BAM file to FASTQ: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset of filtered HISAT2 BAM files
6. Click “Execute”
7. Repeat for the Symplekin RNAi dataset collection

Next we will align the non-rRNA reads to a known set of miRNA hairpin sequences to identify miRNA reads.

8. HISAT2 tool: Run HISAT2 on each collection of filtered HISAT2 output FASTQ files to align non-rRNA reads to reference miRNA hairpin sequences using the following parameters:
• Reads: Click the “Dataset collection” tab and then select the blank sRNA-seq dataset of non-rRNA FASTQ files
• Source for the reference genome to align against: Use a genome from history
• Select the reference genome: Dmel_miRNA_sequences.fa
• Spliced alignment parameters: Specify spliced alignment parameters
• Specify strand-specific information: Second Strand (F/FR)
9. Repeat for the Symplekin RNAi dataset collection

For this tutorial we are not interested in miRNA reads, so we need to extract unaligned reads from the output BAM files. To do this, repeat the Filter SAM or BAM, output SAM or BAM and Convert from BAM to FastQ steps for each dataset collection. Finally, rename the converted FASTQ files something meaningful (e.g. “non-r/miRNA control RNAi sRNA-seq”).

In some instances, miRNA-aligned reads are the desired output for downstream analyses, for example, if we are investigating how loss of key miRNA pathway components affect levels of pre-miRNA and mature miRNAs. In this case, the BAM output of HISAT2 can directly be used in subsequent tools as it contains the subset of reads aligned to miRNA sequences.

Small RNA subclass distinction

In Drosophila, non-miRNA small RNAs are typically divided into two major groups: endogenous siRNAs which are 20-22nt long and piRNAs which are 23-29nt long. We want to analyze these sRNA subclasses independently, so next we are going to filter the non-r/miRNA reads based on length using the Manipulate FASTQ tool.

hands_on Hands-on: Extract subclasses

1. Manipulate FASTQ tool: Run Manipulate FASTQ on each collection of non-r/miRNA reads to identify siRNAs (20-22nt) using the following parameters.
• FASTQ File: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset of non-r/miRNA FASTQ files
• Match Reads by: Set to “Sequence Content”
•  Match by: Enter: ^.{12,19}$^.{23,50}$
• Manipulate Reads on: Set to “Miscellaneous actions”
• Miscellaneous Manipulation Type: Set to “Remove Read”
2. Repeat Manipulate FASTQ for the Symplekin RNAi dataset collection
3. Rename each resulting dataset collection something meaningful (i.e. “blank RNAi - siRNA reads (20-22nt)”)

The regular expression in the Match by parameter tells the tool to identify sequences that are length 12-19 or 23-50 (inclusive), and the Miscellaneous Manipulation Type parameter tells the tool to remove these sequences. What remains are sequences of length 20-22nt. We will now repeat these steps to identify sequences in the size-range of piRNAs (23-29nt).

4. Manipulate FASTQ tool: Run Manipulate FASTQ on each collection of non-r/miRNA reads to identify sequences in the size-range of piRNAs (23-29nt) using the following parameters.
• FASTQ File: Click the “Dataset collection” tab and then select the blank RNAi sRNA-seq dataset of non-r/miRNA FASTQ files
• Match Reads by: Set to “Sequence Content”
•  Match by: Enter: ^.{12,22}$^.{30,50}$
• Manipulate Reads on: Set to “Miscellaneous actions”
• Miscellaneous Manipulation Type: Set to “Remove Read”
5. Repeat Manipulate FASTQ for the Symplekin RNAi dataset collection
6. Rename each resulting dataset collection something meaningful (i.e. “blank RNAi - piRNA reads (23-29nt)”)

7. FastQC tool: Run FastQC on each collection of siRNA and piRNA read files to confirm the correct read lengths.

We see in the above image (for Blank RNAi, replicate 3) that we have ~32k reads that are 20-22nt long in our Reads_20-22_nt_length_siRNAs file and ~16k reads that are 23-29nt long in our Reads_23-29_nt_length_piRNAs file, as expected. Given that we had ~268k trimmed reads in the Blank RNAi replicate 3 file, siRNAs represent ~12.1% of the reads and piRNAs represent ~6.2% of the reads in this experiment. There is no distinct peak in the piRNA range (you can find the example of the real piRNA profile in Figure 1 from Colin D. Malone et al, 2009.

The next step in our analysis pipeline is to identify which RNA features - e.g. protein-coding mRNAs, transposable elements - the siRNAs and piRNAs align to. Most fly piRNAs originate from and thus align to transposable elements (TEs) but some also originate from genome piRNA clusters or protein-coding genes. siRNAs are thought to inhibit TE mobility and potentially target mRNAs for degradation via an RNA-interference (RNAi) mechanism. Ultimately, we want to know whether an RNA feature (TE, piRNA cluster, mRNA, etc.) has significantly different numbers of siRNAs or piRNAs targeting it. To determine this, we will use Salmon to simultaneously align and quantify siRNA reads against known target sequences to estimate siRNA abundance per target. For the remainder of the tutorial we will be focusing on siRNAs, but similar approaches can be done for piRNAs, miRNAs, or any other subclass of small, non-coding RNAs.

siRNA abundance estimation

We want to identify which siRNAs are differentially abundance between the blank and Symplekin RNAi conditions. To do this we will implement a counting approach using Salmon to quantify siRNAs per RNA feature. Specifically, we will analyzing mRNA and TE elements by counting siRNAs against a FASTA file of transcript sequences, not the reference genome. This approach is especially useful in this case because we are interested in TEs, which occur at many copies (100s - 1000s) in the genome, and thus will result in multiple alignments if we align to the genome with a tool like HISAT2. Further, we will be counting abundance of siRNAs that align sense or antisense to an RNA feature independently, as opposite sense alignments are correlated with unique downstream silencing effects. Then, we will provide this information to DESeq2 to generate normalized counts and significance testing for differential abundance of siRNAs per feature.

hands_on Hands-on: siRNA abundance estimation

1. Salmon tool: Run Salmon on each collection of siRNA reads (20-22nt) to quantify the abundance of antisense siRNAs at relevant targets. We will focus on abundance of siRNAs on mRNAs and transposable elements using the following parameters:
• Select a reference transcriptome from your history or use a built-in index?: Use one from the history
• Select the reference transcriptome: Select the reference mRNA and TE fasta file
• The size should be odd number: 19
• FASTQ/FASTA file: Click the “Dataset collection” tab and then select the Blank RNAi siRNA (20-22nt) reads
• Specify the strandedness of the reads: read 1 (or single-end read) comes from the reverse strand (SR)
• Additional Options: Click to expand options
• Incompatible Prior: 0

2. Repeat Salmon for the Symplekin RNAi siRNAs (20-22nt) reads dataset collection
3. Salmon tool: Repeat step 1 on each collection of siRNA reads (20-22nt) to quantify the abundance of sense siRNAs at relevant targets by changing the following parameters:
• Specify the strandedness of the reads: read 1 (or single-end read) comes from the forward strand (SF)
4. Repeat Salmon for the Symplekin RNAi siRNAs (20-22nt) reads dataset collection

The output of Salmon includes a table of RNA features, estimated counts, transcripts per million, and feature length. We will be using the TPM quantification column as input to DESeq2 in the next section.

siRNA differential abundance testing

DESeq2 is a great tool for differential expression analysis, but we also employ it here for estimation of abundance of reads targeting each of our RNA features. As input, DESeq2 can take transcripts per million (TPM) counts produced by Salmon for each feature. TPMs are estimates of the relative abundance of a given transcript in units, and analogously can be used here to quantify siRNAs that align either sense or antisense to transcripts.

hands_on Hands-on: siRNA differential abundance testing

1. DESeq2 tool: Run DESeq2 to test for differential abundance of antisense siRNAs at mRNA and TE features using the following parameters:
• Specify a factor name: Enter: RNAi
• Under “1: Factor level”: Specify a factor level: Enter: Symplekin
• Under “1: Factor level”: Counts file(s): Click the “Dataset collection” tab and then select the Symplekin RNAi siRNA counts from reverse strand (SR)
• Under “2: Factor level”: Specify a factor level: Enter: RNAi
• Under “2: Factor level”: Counts file(s): Click the “Dataset collection” tab and then select the Symplekin RNAi siRNA counts from reverse strand (SR)
• Choice of Input data: Select “TPM values (e.g. from sailfish and salmon)
• Tabular file with Transcript - Gene mapping: Select reference tx2g (transcript to gene) file (available through zenodo link)
• Output normalized counts table: Yes

2. Repeat DESeq2 for the sense siRNAs at mRNA and TE features changing the following parameters:
• Under “1: Factor level”: Counts file(s): Click the “Dataset collection” tab and then select the Symplekin RNAi siRNA counts from forward strand (SF)
• Under “2: Factor level”: Counts file(s): Click the “Dataset collection” tab and then select the Symplekin RNAi siRNA counts from forward strand (SF)
3. Filter tool: Run Filter to extract features with a significantly different antisense siRNA abundance (adjusted p-value less than 0.05).
• Filter: Select the DESeq2 result file from testing antisense (SR) siRNA abundances
• With following condition: Enter: c7<0.05

4. Repeat Filter for the sense siRNAs at mRNA and TE features changing the following parameters:
• Filter: Select the DESeq2 result file from testing sense (SF) siRNA abundances

question Question

How many features have a significant difference in antisense and sense siRNA abundance in the Symplekin RNAi condition?

solution Solution

There are 87 antisense and 15 sense features that have significanlty different siRNA abundances.

For more information about DESeq2 and its outputs, have a look at DESeq2 documentation.

Coming soon!

Conclusion

Analysis of small RNAs is a complicated and intricate process due to the diversity in characteristics, functionality, and nuances of small RNA subclasses. The goal of this tutorial is to introduce you to a common small RNA workflow that specifically identifies changes in endogenous siRNA abundances that target protein-coding mRNAs and transposable elements. The steps presented here can be rearranged and modified based on small RNA features of specific systems and the needs of the user.

Small RNA analysis (general) pipeline

keypoints Key points

• Analysis of small RNAs is complex due to the diversity of small RNA subclasses.
• Both alignment to references and transcript quantification approaches are useful for small RNA-seq analyses.

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.