Long non-coding RNAs (lncRNAs) annotation with FEELnc

Overview
Creative Commons License: CC-BY Questions:
  • How to annotate lncRNAs with FEELnc?

  • How to classify lncRNAs according to their localisation and direction of transcription of proximal RNA transcripts?

  • How to update genome annotation with these annotated lncRNAs?

Objectives:
  • Load data (genome assembly, annotation and mapped RNASeq) into Galaxy

  • Perform a transcriptome assembly with StringTie

  • Annotate lncRNAs with FEELnc

  • Classify lncRNAs according to their location

  • Update genome annotation with lncRNAs

Requirements:
Time estimation: 2 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Sep 23, 2022
Last modification: Nov 3, 2023
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00177
version Revision: 9

Messenger RNAs (mRNAs) are not the only type of RNAs present in organisms (like mammals, insects or plants) and represent only a small fraction of the transcripts. A vast repertoire of small (miRNAs, snRNAs) and long non-coding RNAs (lncRNAs) are also present. Long non-coding RNAs (LncRNAs) are generally defined as transcripts longer than 200 nucleotides that are not translated into functional proteins. They are important because of their major roles in cellular machinery and their presence in large number. Indeed, they are notably involved in gene expression regulation, control of translation or imprinting. Statistics from the GENCODE project reveals that the human genome contains more than 19,095 lncRNA genes, almost as much as the 19,370 protein-coding genes.

Using RNASeq data, we can reconstruct assembled transcripts (with ou without any reference genome) which can then be annotated and identified individually as mRNAs or lncRNAs.

In this tutorial, we will use a software tool called StringTie (StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” 2015) to assemble the transcripts and then FEELnc (FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome” 2017) to annotate the assembled transcripts of a small eukaryote: Mucor mucedo (a fungal plant pathogen).

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.

FEELnc (FlExible Extraction of Long non-coding RNA) is a pipeline to annotate lncRNAs from RNASeq assembled transcripts. It is composed of 3 modules:

  • FEELnc_filter: Extract, filter candidate transcripts.
  • FEELnc_codpot: Compute the coding potential of candidate transcripts.
  • FEELnc_classifier: Classify lncRNAs based on their genomic localization wrt others transcripts.
Agenda

In this tutorial, we will cover:

  1. Data upload
  2. Transcripts assembly with StringTie
  3. lncRNAs annotation with FEELnc
  4. Conclusion

Data upload

To assemble transcriptome with StringTie and annotate lncRNAs with FEELnc, we will use the following files :

  • The genome sequence in fasta format. For this tutorial, we will use the genome assembled in the Flye assembly tutorial.
  • The genome annotation in GFF3 format. We will use the genome annotation obtained in the Funannotate tutorial.
  • Some aligned RNASeq data in bam format. Here, we will use some mapped RNASeq data where mapping was done using STAR.
Hands-on: Data upload
  1. Create a new history for this tutorial

    Click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Long non-coding RNAs (lncRNAs) annotation with FEELnc):

    https://zenodo.org/record/7107050/files/genome_assembly.fasta
    https://zenodo.org/record/7107050/files/genome_annotation.gff3
    https://zenodo.org/record/7107050/files/all_RNA_mapped.bam
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Shared data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

Transcripts assembly with StringTie

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location). This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat, HISAT2 or STAR. The TopHat output is already sorted, but the SAM ouput from other aligners should be sorted using the samtools program.

A reference annotation file in GTF or GFF3 format can be provided to StringTie which can be used as ‘guides’ for the assembly process and help improve the transcript structure recovery for those transcripts.

Hands-on: Transcripts assembly

StringTie ( Galaxy version 2.1.7+galaxy1) with the following parameters:

  • “Input options”: Short reads
  • param-file “Input short mapped reads”: all_RNA_mapped.bam
  • “Specify strand information”: Unstranded
  • “Use a reference file to guide assembly?”: Use reference GTF/GFF3
  • “Reference file”: Use a file from history
    • param-file “GTF/GFF3 dataset to guide assembly”: genome_annotation.gff3
  • “Use Reference transcripts only?”: No
  • “Output files for differential expression?”: No additional output
  • “Output coverage file”: No

We obtain an annotation file (GTF format) which contained all assembled transcripts present in the RNASeq data.

After this step, the transcriptome is assembled and ready for lncRNAs annotation.

Question

How many transcripts are assembled ?

Specific features can be extracted from the GTF file using for example Extract features from GFF data. By selecting transcript From column 3 / Feature, we can select only the transcript elements present in this annotation file. Assembly contains 14,877 transcripts (corresponding to the number of lines in the filtered GTF file).

lncRNAs annotation with FEELnc

FEELnc is a pipeline which is composed of 3 steps. These 3 steps are run automatically when running FEELnc within Galaxy. The first step (FEELnc_filter) consists in filtering out unwanted/spurious transcripts and/or transcripts overlapping (in sense) exons of the reference annotation, and especially protein coding exons as they more probably correspond to new mRNA isoforms.

To use FEELnc, we need to have a reference annotation file in GTF format, which contains protein-coding genes annotation. Presently, we downloaded only the reference annotation file in GFF3 format (annotation.gff3). To convert from GFF3 to GTF format, we will use gffread.

Hands-on: FEELnc
  1. gffread ( Galaxy version 2.2.1.3+galaxy0) with the following parameters:
    • param-file “Input BED, GFF3 or GTF feature file”: genome_annotation.gff3
    • “Feature File Output”: GTF
  2. FEELnc ( Galaxy version 0.2) with the following parameters:
    • param-file “Transcripts assembly”: Assembled transcript (output of StringTie tool)
    • param-file “Reference annotation”: genome_annotation.gtf (Output of gffread tool)
    • param-file “Genome sequence”: genome_assembly.fasta

FEELnc provides 3 output files

  • lncRNA annotation file: annotation file in GTF format which contains the final set of lncRNAs
  • mRNA annotation file: annotation file in GTF format which contains the final set of mRNAs
  • Classifier output file: table containing classification of lncRNAs based on their genomic localisation w.r.t other transcripts (direction: sense or antisense, type: genic, if the lncRNA gene overlaps an RNA gene from the reference annotation file or intergenic (lincRNA) if not).

FEELnc provides also summary file in stdout.

Question

How many RNAs does this annotation contain ? How many interactions between lncRNAs and mRNAs have been identified ? Can you describe the different types of lncRNAs ?

The summary file indicates 104 lncRNAs and 0 new mRNAs were annotated by FEELnc. The initial annotation contains 13,795 mRNAs annotated. Therefore, a total of 13,898 RNAs are currently annotated.

The summary file indicates 652 interactions between lncRNAs and mRNAs. These interactions are described in the Classifier output file.

The different types of lncRNAs (intergenic (sense and antisense), intragenic (sense)) are described in the Classifier output file. We observe that the majority of the lncRNAs are intergenic. These lncRNAs can each have interactions with several mRNAs. Only 7 lncRNAs are genic. These lncRNAs have only one interaction with the mRNA that contains it.

For future analyses, it would be interesting to use an updated annotation containing mRNAs and lncRNAs annotations. Thus, we will merge the reference annotation with those obtained with FEELnc.

Hands-on: Merge the annotations

concatenate ( Galaxy version 0.1.1) with the following parameters:

  • param-file “Datasets to concatenate”: genome_annotation.gtf
  • Insert Dataset
  • param-file “Dataset”: lncRNA annotation with FEELnc

Conclusion

Congratulations for reaching the end of this tutorial! Now you know how to perform an annotation of lncRNAs by using RNASeq data.