Long non-coding RNAs (lncRNAs) annotation with FEELnc

Author(s)	Stéphanie Robin
Editor(s)	Anthony Bretaudeau
Reviewers

Overview
Questions:

How to annotate lncRNAs with FEELnc?

How to classify lncRNAs according to their localisation and direction of transcription of proximal RNA transcripts?

How to update genome annotation with these annotated lncRNAs?

Objectives:

Load data (genome assembly, annotation and mapped RNASeq) into Galaxy

Perform a transcriptome assembly with StringTie

Annotate lncRNAs with FEELnc

Classify lncRNAs according to their location

Update genome annotation with lncRNAs

Requirements:

Introduction to Galaxy Analyses

tutorial Hands-on: Genome annotation with Funannotate

Time estimation: 2 hours

Level: Intermediate Intermediate

Supporting Materials:

Datasets

Workflows

FAQs

video Recordings

video Tutorial (September 2024) - 11m

video View All

instances Available on these Galaxies

Possibly Working

UseGalaxy.cz

UseGalaxy.eu

UseGalaxy.fr

UseGalaxy.no

UseGalaxy.org (Main)

UseGalaxy.org.au

Published: Sep 23, 2022

Last modification: Sep 18, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00177

rating Rating: 5.0 (2 recent ratings, 7 all time)

version Revision: 11

Messenger RNAs (mRNAs) are not the only type of RNAs present in organisms (like mammals, insects or plants) and represent only a small fraction of the transcripts. A vast repertoire of small (miRNAs, snRNAs) and long non-coding RNAs (lncRNAs) are also present. Long non-coding RNAs (LncRNAs) are generally defined as transcripts longer than 200 nucleotides that are not translated into functional proteins. They are important because of their major roles in cellular machinery and their presence in large number. Indeed, they are notably involved in gene expression regulation, control of translation or imprinting. Statistics from the GENCODE project reveals that the human genome contains more than 19,095 lncRNA genes, almost as much as the 19,370 protein-coding genes.

Using RNASeq data, we can reconstruct assembled transcripts (with ou without any reference genome) which can then be annotated and identified individually as mRNAs or lncRNAs.

In this tutorial, we will use a software tool called StringTie (“StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” 2015) to assemble the transcripts and then FEELnc (“FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome” 2017) to annotate the assembled transcripts of a small eukaryote: Mucor mucedo (a fungal plant pathogen).

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.

FEELnc (FlExible Extraction of Long non-coding RNA) is a pipeline to annotate lncRNAs from RNASeq assembled transcripts. It is composed of 3 modules:

FEELnc_filter: Extract, filter candidate transcripts.
FEELnc_codpot: Compute the coding potential of candidate transcripts.
FEELnc_classifier: Classify lncRNAs based on their genomic localization wrt others transcripts.

Agenda

In this tutorial, we will cover:

Data upload

Transcripts assembly with StringTie

lncRNAs annotation with FEELnc

Conclusion

Data upload

To assemble transcriptome with StringTie and annotate lncRNAs with FEELnc, we will use the following files :

The genome sequence in fasta format. For this tutorial, we will use the genome assembled in the Flye assembly tutorial.
The genome annotation in GFF3 format. We will use the genome annotation obtained in the Funannotate tutorial.
Some aligned RNASeq data in bam format. Here, we will use some mapped RNASeq data where mapping was done using STAR.

Hands On: Data upload
Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:
Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Long non-coding RNAs (lncRNAs) annotation with FEELnc):
https://zenodo.org/records/11367439/files/genome_assembly.fasta
https://zenodo.org/records/11367439/files/genome_annotation.gff3
https://zenodo.org/records/11367439/files/SRR8534859_RNASeq_mapped.bam
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import

Transcripts assembly with StringTie

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location). This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat, HISAT2 or STAR. The TopHat output is already sorted, but the SAM ouput from other aligners should be sorted using the samtools program.

A reference annotation file in GTF or GFF3 format can be provided to StringTie which can be used as ‘guides’ for the assembly process and help improve the transcript structure recovery for those transcripts.

Hands On: Transcripts assembly

StringTie ( Galaxy version 2.2.3+galaxy1) with the following parameters:

“Input options”: Short reads

param-file “Input short mapped reads”: SRR8534859_RNASeq_mapped.bam

“Specify strand information”: Unstranded

“Use a reference file to guide assembly?”: Use reference GTF/GFF3

“Reference file”: Use a file from history

param-file “GTF/GFF3 dataset to guide assembly”: genome_annotation.gff3

“Use Reference transcripts only?”: No

“Output files for differential expression?”: No additional output

“Output coverage file”: No

We obtain an annotation file (GTF format) which contained all assembled transcripts present in the RNASeq data.

After this step, the transcriptome is assembled and ready for lncRNAs annotation.

Question

How many transcripts are assembled ?

Specific features can be extracted from the GTF file using for example Extract features from GFF data. By selecting transcript From column 3 / Feature, we can select only the transcript elements present in this annotation file. Assembly contains 17,653 transcripts (corresponding to the number of lines in the filtered GTF file).

lncRNAs annotation with FEELnc

FEELnc is a pipeline which is composed of 3 steps. These 3 steps are run automatically when running FEELnc within Galaxy. The first step (FEELnc_filter) consists in filtering out unwanted/spurious transcripts and/or transcripts overlapping (in sense) exons of the reference annotation, and especially protein coding exons as they more probably correspond to new mRNA isoforms.

To use FEELnc, we need to have a reference annotation file in GTF format, which contains protein-coding genes annotation. Presently, we downloaded only the reference annotation file in GFF3 format (annotation.gff3). To convert from GFF3 to GTF format, we will use gffread.

Hands On: FEELnc

gffread ( Galaxy version 2.2.1.4+galaxy0) with the following parameters:

param-file “Input BED, GFF3 or GTF feature file”: genome_annotation.gff3

“Feature File Output”: GTF

FEELnc ( Galaxy version 0.2.1) with the following parameters:

param-file “Transcripts assembly”: Assembled transcript (output of StringTie tool)

param-file “Reference annotation”: genome_annotation.gtf (Output of gffread tool)

param-file “Genome sequence”: genome_assembly.fasta

FEELnc provides 3 output files

lncRNA annotation file: annotation file in GTF format which contains the final set of lncRNAs
mRNA annotation file: annotation file in GTF format which contains the final set of mRNAs
Classifier output file: table containing classification of lncRNAs based on their genomic localisation w.r.t other transcripts (direction: sense or antisense, type: genic, if the lncRNA gene overlaps an RNA gene from the reference annotation file or intergenic (lincRNA) if not).

FEELnc provides also summary file in stdout.

Question

How many RNAs does this annotation contain ? How many interactions between lncRNAs and mRNAs have been identified ? Can you describe the different types of lncRNAs ?

The summary file indicates 268 lncRNAs and 0 new mRNAs were annotated by FEELnc. The initial annotation contains 13,795 mRNAs annotated. Therefore, a total of 14,063 RNAs are currently annotated.

The summary file indicates 772 interactions between lncRNAs and mRNAs. These interactions are described in the Classifier output file.

The different types of lncRNAs (intergenic (sense and antisense), intragenic (sense)) are described in the Classifier output file. We observe that the majority of the lncRNAs are intergenic. These lncRNAs can each have interactions with several mRNAs. Only 5 lncRNAs are genic. These lncRNAs have only one interaction with the mRNA that contains it.

For future analyses, it would be interesting to use an updated annotation containing mRNAs and lncRNAs annotations. Thus, we will merge the reference annotation with those obtained with FEELnc.

Hands On: Merge the annotations

concatenate ( Galaxy version 1.0.0) with the following parameters:

param-file “Datasets to concatenate”: genome_annotation.gtf (Output of gffread tool)

Insert Dataset

param-file “Dataset”: lncRNA annotation with FEELnc

Conclusion

Congratulations for reaching the end of this tutorial! Now you know how to perform an annotation of lncRNAs by using RNASeq data.

You've Finished the Tutorial

Key points

StringTie allows to perform a transcriptome assembly using mapped RNASeq data and provides an annotation file containing trancripts description.

FEELnc pipeline allows to perform annotation of long non-coding RNAs (lncRNAs).

Annotation is based on reconstructed transcripts from RNA-seq data (either with or without a reference genome)

Annotation can be performed without any training set of non-coding RNAs.

FEELnc provides the localisation and the direction of transcription of proximal RNA transcripts of lncRNAs.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads, 2015 Nature Biotechnology 33: 290–295. 10.1038/nbt.3122
FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome, 2017 Nucleic Acids Research 45: e57. 10.1093/nar/gkw1306

Glossary

LncRNAs: Long non-coding RNAs
lncRNAs: long non-coding RNAs
mRNAs: Messenger RNAs

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Stéphanie Robin, Long non-coding RNAs (lncRNAs) annotation with FEELnc (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/lncrna/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{genome-annotation-lncrna,
author = "Stéphanie Robin",
	title = "Long non-coding RNAs (lncRNAs) annotation with FEELnc (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/lncrna/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/genome-annotation/tutorials/lncrna/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: text_processing
  owner: bgruening
  revisions: d698c222f354
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: gffread
  owner: devteam
  revisions: 154d00cbbf2d
  tool_panel_section_label: Convert Formats
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: gffread
  owner: devteam
  revisions: 3e436657dcd0
  tool_panel_section_label: Convert Formats
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: feelnc
  owner: iuc
  revisions: 67af24676bd6
  tool_panel_section_label: RNA Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: stringtie
  owner: iuc
  revisions: eba36e001f45
  tool_panel_section_label: RNA Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 7

June 2025

5 stars: Liked: thanks, we can study lncRNA using RNA-Seq data!

October 2024

5 stars: Liked: My doctoral work focuses on the ncRNA part, specifically on lncRNAs. It has been difficult to identify them, especially due to their transcriptional diversity, and now I feel more confident working with my databases thanks to this mentoring. Disliked: Expanding the selection of ncRNAs to circRNAs