Indexing and profiling microbes with MetaSBT
Author(s) |
![]() |
OverviewQuestions:
Objectives:
What are Sequence Bloom Trees and how does MetaSBT use them for genomics?
How do you build a custom viral database from a set of reference genomes?
How can an existing MetaSBT database be updated with new viruses?
How do you profile a query viral genome against a database to determine its identity and novelty?
How can we interpret the viral genome profiling results?
Requirements:
Understand the fundamental concepts of MetaSBT for efficient genomic indexing.
Learn to use the
metasbt_index
tool to create a custom viral database.Learn to use the
metasbt_index
tool to update an existing database.Learn to use the
metasbt_profile
tool to identify and characterize a query viral genome.Be able to interpret the similarity reports generated by the profiling tool.
Time estimation: 30 minutesSupporting Materials:Published: Aug 25, 2025Last modification: Aug 25, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITversion Revision: 7
Introduction
The ever-increasing volume of sequenced viral genomes, from global surveillance efforts and metagenomic studies, presents a significant computational challenge. How can we efficiently catalog all known viral genomes and then quickly determine the identity of a newly sequenced one?
MetaSBT is a powerful computational method designed to address this challenge. It uses a data structure called a Sequence Bloom Tree (SBT), which is a tree-like index where each leaf contains a Bloom filter. A Bloom filter is a space-efficient probabilistic data structure that allows for rapid checking of whether an element (in this case, a k-mer from a genome) is part of a set. By organizing these filters into a tree, MetaSBT can quickly search across thousands of genomes at once.
This approach gives MetaSBT several key advantages:
- Speed: It can query massive databases much faster than traditional alignment-based methods.
- Low Memory Footprint: It uses significantly less RAM than other methods, making it accessible on standard hardware.
- Novelty Detection: It can distinguish k-mers from a query genome that are present in the database from those that are new or unknown.
In this tutorial, we will learn how to use the MetaSBT suite in Galaxy to perform three key tasks with viral genomes:
- Create a new MetaSBT database from a small set of reference viruses.
- Update this database with new viruses.
- Profile a query virus against our database to determine its identity and assess its novelty.
Let’s get started!
Hands On: Prepare the History
- Create a new history for this tutorial and give it a name like “MetaSBT Viral Profiling”.
Import the 29 viral reference genomes for this tutorial belonging to 5 different viral species. Open the Galaxy galaxy-upload Upload Manager and choose Paste/Fetch data. Paste the following URLs into the text box:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/882/815/GCA_000882815.1_ViralProj36615/GCA_000882815.1_ViralProj36615_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/366/285/GCA_002366285.1_ViralProj411812/GCA_002366285.1_ViralProj411812_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/787/195/GCA_004787195.1_ASM478719v1/GCA_004787195.1_ASM478719v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/787/735/GCA_004787735.1_ASM478773v1/GCA_004787735.1_ASM478773v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/788/155/GCA_004788155.1_ASM478815v1/GCA_004788155.1_ASM478815v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/864/885/GCA_000864885.1_ViralProj15500/GCA_000864885.1_ViralProj15500_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/885/GCA_009937885.1_ASM993788v1/GCA_009937885.1_ASM993788v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/895/GCA_009937895.1_ASM993789v1/GCA_009937895.1_ASM993789v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/905/GCA_009937905.1_ASM993790v1/GCA_009937905.1_ASM993790v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/857/045/GCA_000857045.1_ViralProj15142/GCA_000857045.1_ViralProj15142_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/325/GCA_006458325.1_ASM645832v1/GCA_006458325.1_ASM645832v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/385/GCA_006458385.1_ASM645838v1/GCA_006458385.1_ASM645838v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/425/GCA_006458425.1_ASM645842v1/GCA_006458425.1_ASM645842v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/665/GCA_006458665.1_ASM645866v1/GCA_006458665.1_ASM645866v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/851/145/GCA_000851145.1_ViralMultiSegProj14892/GCA_000851145.1_ViralMultiSegProj14892_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/864/105/GCA_000864105.1_ViralMultiSegProj15617/GCA_000864105.1_ViralMultiSegProj15617_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/865/085/GCA_000865085.1_ViralMultiSegProj15622/GCA_000865085.1_ViralMultiSegProj15622_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/865/725/GCA_000865725.1_ViralMultiSegProj15521/GCA_000865725.1_ViralMultiSegProj15521_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/866/645/GCA_000866645.1_ViralMultiSegProj15620/GCA_000866645.1_ViralMultiSegProj15620_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/862/125/GCA_000862125.1_ViralProj15306/GCA_000862125.1_ViralProj15306_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/865/065/GCA_000865065.1_ViralProj15599/GCA_000865065.1_ViralProj15599_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/866/625/GCA_000866625.1_ViralProj15598/GCA_000866625.1_ViralProj15598_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/871/845/GCA_000871845.1_ViralProj20183/GCA_000871845.1_ViralProj20183_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/786/575/GCA_004786575.1_ASM478657v1/GCA_004786575.1_ASM478657v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/745/045/GCA_029745045.1_ASM2974504v1/GCA_029745045.1_ASM2974504v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/745/055/GCA_029745055.1_ASM2974505v1/GCA_029745055.1_ASM2974505v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/745/065/GCA_029745065.1_ASM2974506v1/GCA_029745065.1_ASM2974506v1_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/744/035/GCA_029744035.1_ASM2974403v1/GCA_029744035.1_ASM2974403v1_genomic.fna.gz
Also import the table from the Zenodo record:
https://zenodo.org/record/15882806/files/taxonomies.tsv
- Click Start and then Close the upload manager.
- Once uploaded, you will have 30 items in your history. The table retrieved at the previous step contains a mapping between the name of the genomes and their complete taxonomic labels. Note that four genomes are missing from this table. Don’t worry, we are going to use them later for updating our database and demonstrate how to profile new genomes:
GCA_000882815.1_ViralProj36615_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
GCA_002366285.1_ViralProj411812_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
GCA_004787195.1_ASM478719v1_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
GCA_004787735.1_ASM478773v1_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
GCA_004788155.1_ASM478815v1_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
GCA_000864885.1_ViralProj15500_genomic
->k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
GCA_009858895.3_ASM985889v3_genomic
->k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
GCA_009937885.1_ASM993788v1_genomic
->k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
GCA_009937895.1_ASM993789v1_genomic
->k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
GCA_009937905.1_ASM993790v1_genomic
->k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
GCA_000857045.1_ViralProj15142_genomic
->k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
GCA_006458325.1_ASM645832v1_genomic
->k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
GCA_006458385.1_ASM645838v1_genomic
->k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
GCA_006458425.1_ASM645842v1_genomic
->k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
GCA_006458665.1_ASM645866v1_genomic
->k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
GCA_000851145.1_ViralMultiSegProj14892_genomic
->k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
GCA_000864105.1_ViralMultiSegProj15617_genomic
->k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
GCA_000865085.1_ViralMultiSegProj15622_genomic
->k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
GCA_000865725.1_ViralMultiSegProj15521_genomic
->k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
GCA_000866645.1_ViralMultiSegProj15620_genomic
->k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
GCA_000862125.1_ViralProj15306_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
GCA_000865065.1_ViralProj15599_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
GCA_000866625.1_ViralProj15598_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
GCA_000871845.1_ViralProj20183_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
GCA_004786575.1_ASM478657v1_genomic
->k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
Building a comprehensive database from all NCBI viral reference genomes can be time-consuming. For many applications, you can directly use a pre-built database. Many Galaxy servers provide access to these via a CVMFS (CernVM File System) data directory.
For example, you might find pre-built MetaSBT databases at a path like:
/cvmfs/data.galaxyproject.org/byhand/MetaSBT/
You can find more information about how Galaxy uses CVMFS for reference data.
Part 1: Creating a MetaSBT Database from Scratch
Our first step is to build a small, custom database using our reference viruses.
Hands On: Build the Initial Index
- Run the tool metasbt_index tool with the following parameters:
- param-collection “Input genomes”: Select all the reference genomes previously imported into the history
*.fna.gz
.- param-file “Input table with taxonomic labels”: Select the table with the mapping between the name of the genomes and their complete taxonomic labels.
- Under the “Advanced options” section, param-file “MetaSBT database” is automatically set to “Build your own MetaSBT database from scratch”. This gives us access to a set of advanced options for the estimation of the best k-mer length and the estimation of a proper Bloom filter size according to our specific set of input genomes, and a quality control filter to get rid of genomes that do not satisfy certain quality criteria based on a completeness and contamination threshold. For the specific purpose of this tutorial, we should set the following options:
- “K-mer length” > “Set a k-mer length”:
9
- “Bloom filter size” > “Set a bloom filter size”:
10000
- Click Execute. The tool will produce three output files:
- A
clusters
table with the list of clusters defined by MetaSBT according to the taxonomic organization of the input genomes, together with the number of genomes in each cluster and the density of the bloom filters.- A
genomes
table with the list of genomes in the MetaSBT clusters and their assigned taxonomic labels.- A
database
compressed tarball representing the actual MetaSBT database.Question
- What does the k-mer size parameter represent?
- Why did we compress the output database tarball?
- There are three genomes missing in the database. Why?
- The k-mer size is the length of the short DNA/RNA sequences (k-mers) that are extracted from the genomes and stored in the Bloom filters.
- The MetaSBT database is actually a directory containing multiple files. Compressing it into a single
.tar.gz
file makes it much easier to manage and pass to other tools or share it with other users within Galaxy.- Although we selected the whole set of genomes in our history
*.fna.gz
, the input table with the mapping between the name of the genomes and their complete taxonomic labels does not report any information about four genomes. Thus, they are automatically excluded. This has been done on purpose specifically for this tutorial. We are now going to use these three of these missing genomes for updating the database!
We now have a database containing the genomic information for a specific selection of viral species.
Part 2: Updating an Existing MetaSBT Database
Viral databases are rarely static. New strains and species are constantly being discovered. Instead of rebuilding our index from scratch every time, we can simply update it.
We will now add three more Monkeypox virus genomes to the database we just created.
Hands On: Update the Database
- Run the tool metasbt_index tool again with these parameters:
- param-collection “Input genomes”: Select
GCA_029745045.1_ASM2974504v1_genomic.fna.gz
,GCA_029745055.1_ASM2974505v1_genomic.fna.gz
, andGCA_029745065.1_ASM2974506v1_genomic.fna.gz
.- Under the “Advanced options” section, “MetaSBT database”: “Update a MetaSBT database”.
- “Will you use a MetaSBT database from your history or a public database?”: “Use a database from the history”.
- param-file “Select a MetaSBT database”:
database
.- Click Execute. This will produce a new MetaSBT database, thus, three new
clusters
,genomes
, anddatabase
files will appear in your history. These three new genomes have been profiled against our database and assigned to the closest species cluster, i.e., Monkeypox virus in this specific case.Question
- Why is updating faster?
- Why didn’t we specify the k-mer length nor the Bloom filter size?
- When updating, MetaSBT only needs to process the new genomes and insert them into the existing tree structure. It doesn’t need to re-process the genomes that are already in the index. For very large databases, this can save hours or even days of computation time.
- In case of an update, the way we build the Bloom filter representation of the new genomes must be consistent with how we previously built our database. Thus, this information is implicitly inherited from the selected MetaSBT database.
Part 3: Profiling a Query Virus Against the Database
Now we have a database containing five known viral species, and we have a new “query” genome whose identity we want to determine. Let’s use MetaSBT to find out what our query virus is and how it relates to the viruses in our database.
Our query genome is GCA_029744035.1_ASM2974403v1_genomic.fna.gz
. We will profile it against our database.
Hands On: Profile the Query Virus
- Run the tool metasbt_profile tool with the following parameters:
- param-collection “Input genomes”: Select
GCA_029744035.1_ASM2974403v1_genomic.fna.gz
.- “Database source”: “Use a database from the history”.
- param-file “Select a MetaSBT database tarball”: Select the second
database
file (the one from the update step).Click Execute. The tool will generate a collection containing a profiling report for each of the input query genomes in tabular format.
- Click the galaxy-eye (eye) icon on the generated report to view its contents.
QuestionThe output report has different rows. Why there are multiple matches under the same taxonomic level?
MetaSBT may report multiple taxonomic units under the same taxonomic level if their distance from the input query genome is below a specific threshold which is established considering the distance to the closest taxonomic unit minus its 20% (by default). This last percentage is called uncertainty percentage and can be changed under the “Advanced options” section.
Analyzing the Results
When you view one of the profiles in the output collection, you should see a report where each row represents a taxonomic level and its corresponding Average Nucleotide Identity (ANI) value. ANI is a measure of genomic similarity between two genomes. Here, the ANI is expressed as a distance measure, so the lower it is, the closer a specific taxonomic unit is to the input genome.
The report is structured into three columns:
- level: This column indicates the taxonomic rank (i.e., kingdom, phylum, class, order, family, genus, and species).
- closest: This column displays the lineage of the closest match found in the database at that specific taxonomic level.
- ani: This column shows the ANI distance for the best match found. A lower ANI distance suggests a closer relationship with the input query genome.
This output helps you understand the genomic relatedness of your query to the genomes present in the database at various taxonomic resolutions. You can observe how the ANI changes as you move down the taxonomic hierarchy, providing insights into the closest classification of your genome.
Conclusion
In this tutorial, you have learned the complete workflow for using MetaSBT in Galaxy. You have successfully:
- Built a custom database from a selection of viral reference genomes;
- Efficiently updated that database with new viruses;
- Profiled an unknown virus against the database to determine its identity (or lack thereof);
- Interpreted the similarity reports to identify a query virus.
You are now equipped to use MetaSBT for your own research. You can create custom databases for curated sets of genomes, or leverage large, pre-built databases to rapidly identify and characterize newly sequenced genomes from clinical or environmental samples.