Functional annotation of protein sequences

Overview
Creative Commons License: CC-BY Questions:
  • How to perform functional annotation on protein sequences?

Objectives:
  • Perform functional annotation using EggNOG-mapper and InterProScan

Requirements:
Time estimation: 1 hour
Level: Introductory Introductory
Supporting Materials:
Published: Jul 20, 2022
Last modification: Jan 8, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00173
version Revision: 17

When performing the structural annotation of a genome sequence, you get the position of each gene, but you don’t have information about their name of their function. That’s the goal of functional annotation.

In this short tutorial, we will run the most commonly used tools to perform functional annotation, starting from the predicted protein sequences of a few example genes.

For a more complete view of how this step integrates into a whole genome sequencing and annotation process, you can have a look at the Funannotate tutorial.

Agenda

In this tutorial, we will cover:

  1. Data upload
  2. Functional annotation
    1. EggNOG Mapper
    2. InterProScan
  3. Conclusion

Data upload

We will annotate a small set of protein sequences. These sequences were predicted from the gene structures obtained in the Funannotate tutorial? Though these sequences from from a fungal species, you can run the same tools on proteins from any organisms, including prokaryotes.

Hands-on: Data upload
  1. Create a new history for this tutorial

    Click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Functional annotation of protein sequences):

    https://zenodo.org/record/6861851/files/proteins.fasta
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Shared data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

Functional annotation

EggNOG Mapper

EggNOG Mapper compares each protein sequence of the annotation to a huge set of ortholog groups from the EggNOG database. In this database, each ortholog group is associated with functional annotation like Gene Ontology (GO) terms or KEGG pathways. When the protein sequence of a new gene is found to be very similar to one of these ortholog groups, the corresponding functional annotation is transfered to this new gene.

Hands-on
  1. eggNOG Mapper ( Galaxy version 2.1.8+galaxy3) with the following parameters:
    • param-file “Fasta sequences to annotate”: proteins.fasta (Input dataset)
    • “Version of eggNOG Database”: select the latest version available
    • In “Output Options”:
      • “Exclude header lines and stats from output files”: No

The output of this tool is a tabular file, where each line represents a gene from our annotation, with the functional annotation that was found by EggNOG-mapper. It includes a predicted protein name, GO terms, EC numbers, KEGG identifiers, …

Display the file and explore which kind of identifiers were found by EggNOG Mapper.

InterProScan

InterPro is a huge integrated database of protein families. Each family is characterized by one or muliple signatures (i.e. sequence motifs) that are specific to the protein family, and corresponding functional annotation like protein names or Gene Ontology (GO). A good proportion of the signatures are manually curated, which means they are of very good quality.

InterProScan is a tool that analyses each protein sequence from our annotation to determine if they contain one or several of the signatures from InterPro. When a protein contains a known signature, the corresponding functional annotation will be assigned to it by InterProScan.

InterProScan itself runs multiple applications to search for the signatures in the protein sequences. It is possible to select exactly which ones we want to use when launching the analysis (by default all will be run).

Hands-on
  1. InterProScan ( Galaxy version 5.59-91.0+galaxy3) with the following parameters:
    • param-file “Protein FASTA File”: proteins.fasta (Input dataset)
    • “InterProScan database”: select the latest version available
    • “Use applications with restricted license, only for non-commercial use?”: Yes (set it to No if you run InterProScan for commercial use)
    • “Output format”: Tab-separated values format (TSV) and XML
Comment

To speed up the processing by InterProScan during this tutorial, you can disable Pfam and PANTHER applications. When analysing real data, it is adviced to keep them enabled.

When some applications are disabled, you will of course miss the corresponding results in the output of InterProScan.

The output of this tool is both a tabular file and an XML file. Both contain the same information, but the tabular one is more readable for a Human: each line represents a gene from our annotation, with the different domains and motifs that were found by InterProScan.

If you display the TSV file you should see something like this:

InterProScan TSV output

Each line correspond to a motif found in one of the annotated proteins. The most interesting columns are:

  • Column 1: the protein identifier
  • Column 5: the identifier of the signature that was found in the protein sequence
  • Column 4: the databank where this signature comes from (InterProScan regroups several motifs databanks)
  • Column 6: the human readable description of the motif
  • Columns 7 and 8: the position where the motif was found
  • Column 9: a score for the match (if available)
  • Column 12 and 13: identifier of the signature integrated in InterPro (if available). Have a look an example webpage for IPR036859 on InterPro.
  • The following columns contains various identifiers that were assigned to the protein based on the match with the signature (Gene ontology term, Reactome, …)

The XML output file contains the same information in a computer-friendly format, we will use it in the next step.

Conclusion

Congratulations for reaching the end of this tutorial! Now you know how to perform the functional annotation of a set of protein sequences, using EggNOG mapper and InterProScan.

If you want to collect more functional annotation, you can try to run the NCBI BLAST+ blastp ( Galaxy version 2.10.1+galaxy2) or Diamond ( Galaxy version 2.0.15+galaxy0) tools against the UniProt or NR databases (Diamond runs much faster on big datasets). These tools will search for similarities between your protein sequences and the ones already described in big international databases.

Also note that many other more specialised tools exist to collect even more functional annotation, in particular for certain species (prokaryotes forexample), or enzyme/protein families.