Text-mining with the SimText toolset

Author(s)	Marie Gramm Dennis Lal group Daniel Blankenberg
Reviewers

Overview
Questions:

How can I automatically collect PubMed data for a set of biomedical entities such as genes?

How can I analyze similarities among biomedical entities based on PubMed data on large-scale?

Objectives:

Learn how to use the SimText toolset

Upload a table with biomedical entities in Galaxy

Retrieve PubMed data for each of the biomedical entities

Extract biomedical terms from the PubMed data for each biomedical entity

Analyze the similarity among the biomedical entities based on the extracted data in an interactive app

Time estimation: 1 hour

Supporting Materials:

Datasets

Workflows

FAQs

Published: Apr 5, 2021

Last modification: May 7, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00272

version Revision: 8

Literature exploration in PubMed on a large number of biomedical entities (e.g., genes, diseases, or experiments) can be time-consuming and challenging, especially when assessing associations between entities. Here, we use SimText, a toolset for literature research that allows you to collect text from PubMed for any given set of biomedical entities, extract associated terms, and analyze similarities among them and their key characteristics in an interactive tool.

This tutorial is based on a proof-of-concept example given in Gramm et al. 2020. We are going to analyze similarities among 95 genes based on their associated biomedical terms in the literature, and compare their pre-existing disorder categories to their grouping based on the literature.

The workflow combines 3 main steps, starting with the retrieval of PubMed data for each of the genes. We then use the PubMed data from each gene to extract related scientific terms that are all combined in one large binary matrix. Finally, we explore the generated data in an interactive tool that performs different unsupervised machine-learning algorithms to analyze the similarities/ the grouping among the genes based on their extracted terms from the literature.

Tutorial overview. — **Figure 1**: Schematic presentation of the workflow.

Agenda

In this tutorial, we will cover:

Input data

Retrieval of PubMed data

Extraction of biomedical terms from PubMed abstracts

Exploration of data in interactive tool

Conclusion

Input data

The input data is a simple table with the genes we want to analyze as well as their pre-existing grouping (the grouping is required later on to compare it to our text-based gene grouping). In order for the tools to recognize the column with the biomedical entities of interest, our 95 genes, the column name should start with “ID_”, and for the grouping variable with “GROUPING_”.

Hands On: Data upload
Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:
Import the input file from Zenodo
https://zenodo.org/records/4638516/files/clingen_data
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

Retrieval of PubMed data

In the first step we collect PubMed data for each of the genes. The genes are used as search queries to download a defined number of PMIDs, here up to 500, from PubMed. The PMIDs are saved in additional columns of our input data.

To speed up the the download of PubMed data users can obtain an API key from the settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/) and add it to the Galaxy user-preferences (User → Preferences → Manage Information).

Hands On: Step 1: PubMed query tool

Run PubMed query ( Galaxy version 0.0.2) with the following parameters:

param-file “Input file with query terms”: Input dataset

“Number of PMIDs (or abstracts) to save per ID”: 500

Comment

The tool is also able to save the abstracts as text instead of their PMIDs. This feature is used for another type of analysis (see Gramm et al. 2020), or can be used if the user wants to use the tool independent from a workflow to retrieve many abstract texts at once. For the next step of our example in this tutorial, only PMIDs are required.

Extraction of biomedical terms from PubMed abstracts

Next, we extract the 100 most frequent ‘Disease’ and ‘Gene’ terms (PubTator annotations) from the PubMed data. All genes with their 100 associated terms are then combined in one large binary matrix. Each row represents a gene and each column one of the extracted terms. This matrix is later used to find similar genes, i.e. genes that have many common terms associated with them.

Hands On: Extraction of PubTator annotations

Run PMIDs to PubTator ( Galaxy version 0.0.2) with the following parameters:

param-file “Input file with PMID IDs”: output of PubMed query tool

“categories”: Genes Diseases

“Number of most frequent terms/IDs to extract.”: 100

Comment: PubTator

PubTator annotates terms of the following categories: Gene, Disease, Mutation, Species and Chemical. In this example we chose to only extract gene and disease terms but you can also select other categories if you are interested in those.

Exploration of data in interactive tool

After generating the large binary matrix, we can explore the similarities/ the grouping among the genes in the interactive SimText tool. The following features are generated:

Word clouds for each gene
Dimension reduction and hierarchical clustering of the binary matrix
Calculation of the adjusted rand index (similarity between text-based grouping and the pre-existing disorder categories)
Table with terms and their frequency among the genes

Hands On: Explore data interactively

Run interactive_tool_simtext_app with the following parameters:

param-file “Input file”: initial input file with genes and pre-existing grouping

param-file “Matrix file”: output of PMIDs to PubTator tool

Open interactive tool

Go to User > Active InteractiveTools

Wait for the SimText app to be running (Job Info)

Click on SimText app

Wordcloud. — **Figure 2**: Left: For all genes a wordcloud with their associated terms cen be generated. Right: In the dimension reduction plot the similarities among the genes can be explored. The colouring corresponds to the pre-existing disorder categories.

Clustering. — **Figure 3**: Hierarchical clustering of the matrix. Different clustering methods can be selected and the adjusted rand index be calculated.

Conclusion

You've Finished the Tutorial

Key points

The SimText toolset allows large-scale literature analysis of a set of biomedical entities such as genes or diseases

The similarities among the biomedical entities can be explored interactively

The litertaure based grouping can be compared to an existing grouping to discover similarities/ relationships hidden in the literature

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Gramm, M., E. Perez-Palma, S. Schumacher-Bass, J. Dalton, C. Leu et al., 2020 SimText: A text mining framework for interactive analysis and visualization of similarities among biomedical entities. BioRxiv. 10.1101/2020.07.06.190629

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Marie Gramm, Dennis Lal group, Daniel Blankenberg, Text-mining with the SimText toolset (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/text-mining_simtext/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{statistics-text-mining_simtext,
author = "Marie Gramm and Dennis Lal group and Daniel Blankenberg",
	title = "Text-mining with the SimText toolset (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/text-mining_simtext/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/statistics/tutorials/text-mining_simtext/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: pmids_to_pubtator_matrix
  owner: iuc
  revisions: 69714f06f18b
  tool_panel_section_label: Other Tools
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: pubmed_by_queries
  owner: iuc
  revisions: 02e46a96e98a
  tool_panel_section_label: Other Tools
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.