Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring

Author(s)	Simon Bray
Reviewers

Overview
Questions:

How can candidate ligands be generated and docked to a protein in Galaxy?

How can the poses of the docked ligands be evaluated?

How can a workflow for drug virtual screening be constructed in Galaxy?

Objectives:

Understand how Galaxy was used to perform docking and pose scoring on the SARS-CoV-2 main protease (MPro).

Replicate the study on a (very) small scale

Gain familiarity with the docking and scoring techniques involved.

Requirements:

Introduction to Galaxy Analyses

tutorial Hands-on: Protein-ligand docking

Time estimation: 2 hours

Level: Intermediate Intermediate

Supporting Materials:

Datasets

Workflows

FAQs

Published: Mar 27, 2020

Last modification: Nov 9, 2023

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00049

rating Rating: 3.6 (0 recent ratings, 5 all time)

version Revision: 10

This tutorial provides a companion to the work performed in March 2020 by InformaticsMatters, the Diamond Light Source, and the European Galaxy Team to perform virtual screening on candidate ligands for the SARS-CoV-2 main protease (MPro). This work is described in our dedicated site.

In this tutorial, you will perform protein-ligand docking to MPro using rxDock (Ruiz-Carmona et al. 2014), a version of the popular rDock software, and score the results using two different methods. The same tools will be used as in the original study, but with a smaller dataset.

Agenda

In this tutorial, we will cover:

Background

Get data

Preparation for docking

Charge enumeration

Generate three-dimensional conformations

Splitting the SD-file into a collection

Active site preparation

Convert protein structure to MOL2 format

Generate Frankenstein ligand

Generate active site definition

Docking and scoring

Docking with rxDock

Collapse collection to a single file

Pose scoring with TransFS

Pose scoring with SuCOS

Compound selection

Conclusion

Background

Early in March 2020, the Diamond Light Source completed a successful fragment screen on MPro, which provided 55 fragment hits (see their press release here). In an effort to identify candidate molecules for binding, InformaticsMatters, the XChem group and the European Galaxy team joined forces to construct and execute a Galaxy workflow for performing and evaluating molecular docking on a massive scale.

An initial list of 41,000 candidate molecules was assembled by using the Fragalysis fragment network to elaborate from the initial fragment hits, as described in their documentation. These were used as inputs for the docking and scoring workflow. The workflow consists of the following steps, each of which was carried out using tools installed on the European Galaxy server:

Charge enumeration of the 41,000 candidate molecules selected based on the fragment hits.
Generation of 3D conformations based on SMILES strings of the candidate molecules.
Docking of molecules into each of the MPro structures using rxDock.
Evaluation of the docking poses using a TransFS, a deep learning approach (Scantlebury 2019) developed by the XChem group and collaborators, and SuCOS scoring (Leong 2019), which compares the poses with the structures of the original fragment hits.

The original study required almost 20 years of CPU time, not counting GPU resources consumed. This is obviously not reproducible as a tutorial. Therefore, we will repeat the workflow with a small library of just 100 molecules, on a single MPro fragment structure. Links will be provided to original Galaxy histories, with notes to explain where and why things were done differently to the tutorial.

MPro structure, with a fragment bound. — **Figure 1**: Structure of MPro, with a fragment bound. Click to view in NGL. (Rose *et al.* 2018)

Get data

We require three datasets for the simulation and analysis:

A list of 100 ligand candidates. These are the molecules which will be docking into the protein binding site.
A PDB file of the receptor MPro protein (without ligand or solvent).
A list of fragment hits (17 in total) in SDF format.

Of the initial 55 fragment hits, 17 were chosen for further study. From these, 41,587 compounds were generated using the Fragalysis fragment network for further study. The 100 compounds used in the tutorial are taken from this list.

Starting data is available from this Galaxy history: https://usegalaxy.eu/u/sbray/h/mpro-raw-data.

This history contains 103 files. One of these (Initial candidates for docking) contains the 41k candidate compounds in SMILES format. The remaining 102 files (all with names beginning with Mpro-x...) provide structural information on the fragment hits - 6 files per hit (hence 17 x 6 = 102).

The identity of the files is as follows:

the *_0.mol files contain the fragment structure in mol format.

the *_0.pdb files contain the fragment structure in pdb format.

the *_0_apo.pdb files contain the protein with solvent, but without ligand

the *_0_apo-desolv.pdb files contain the protein without either solvent or ligand

the *_0_apo-solv.pdb files contain only solvent

the *_0_bound.pdb file contain everything (protein, ligand and solvent)

The PDB file of the receptor that we are using is Mpro-x0195_0_apo-desolv.pdb. In other words, the structure is derived from just one fragment hit. In the original study, however, all compounds were docked against all of the fragment hit structures.

Hands On: Data upload
Create a new history for this tutorial
Import the files from Zenodo:
https://zenodo.org/record/3730474/files/candidates.smi
https://zenodo.org/record/3730474/files/Mpro-x0195_0_apo-desolv.pdb
https://zenodo.org/record/3730474/files/hits.sdf
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Rename the datasets Candidates SMILES, Receptor PDB and Hits SDF respectively.

Check that the datatypes (smi, pdb, and sdf respectively) are correct. In particularly, check the Candidates SMILES file, as the SMILES datatype is not detected automatically by Galaxy.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select datatypes from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Preparation for docking

Before docking, the candidate ligands need to be prepared for docking with the following steps: 1) charge enumeration, 2) generation of three-dimensional structures, and 3) splitting the SD-file into a collection.

This stage is carried out as described here, except of course with the full set of 42,000 compounds. See the docking prep workflow for more details.

Charge enumeration

Many of the compounds may contain functional groups which can exist in multiple charge states, and this will affect the quality of binding to the receptor dramatically. Therefore, we perform ‘charge enumeration’, which means that we generate all charge forms of the compounds within a certain pH range.

Hands On: Charge enumeration

Enumerate charges ( Galaxy version 2020.03.4+galaxy0) with the following parameters:

“Input molecule data”: Candidate SMILES

“Minimum pH”: 4.4

“Maximum pH”: 10.4

Rename the output file Enumerated candidates SMILES.

Comment

The Enumerate charges tool tool is based on the Dimorphite-DL program. (Ropp et al. 2019)

The output is another SMILES file, with several hundred entries.

Generate three-dimensional conformations

So far our list of enumerated candidate compounds is still in SMILES format; we need to produce three-dimensional structures in SDF format for docking. This can be done with the Compound conversion tool tool.

If you are not familiar with SMILES and SDF formats, consult the introductory protein-ligand docking tutorial for more details.

Hands On: Convert to SDF format

Compound conversion ( Galaxy version 3.1.1+galaxy0) with the following parameters:

“Molecular input file”: Enumerated candidates dataset.

“Output format”: MDL MOL format (sdf, mol)

“Generate 3D coordinates”: Yes

Rename the output file Enumerated candidates SDF.

Comment

The Compound conversion tool tool is based on the OpenBabel toolkit. (O’Boyle et al. 2011)

Splitting the SD-file into a collection

The next stage is to split the SD-file with the candidate ligands into a set of smaller SD-files.

Question

Why is splitting the file necessary?

The rxDock tool performs one docking at a time (more technically: the task is not parallelized, as it uses only a single CPU). Therefore, splitting the large SD-file into many small files allows the work to be carried out by multiple Galaxy jobs in parallel, so it completes faster.

In the original study, this kind of parallelization was absolutely essential because of the enormous dataset; at some points, there were 5,000 docking jobs running concurrently on the European Galaxy server. Even on the much smaller scale of this tutorial, we can speed things up considerably using this trick.

Hands On: Split the SD-files

Split file to dataset collection ( Galaxy version 0.5.0) with the following parameters:

“Select the file type to split”: SD-files

“SD-file to split”: Enumerated candidates SDF

“Specify number of output files or number of records per file?”: Number of output files

“Number of new files”: 10

“Method to allocate records to new files”: Alternate output files

Active site preparation

The active site also needs to be prepared for docking, using the following steps: 1) conversion to MOL2 format, and 2) generation of the active site using the rbcavity tool tool.

This stage was carried out as described here. However, it was repeated for each of the fragment hit structures, not just the Mpro-x0195_0_apo-desolv.pdb file used here. See the active site prep workflow for more details.

Convert protein structure to MOL2 format

The receptor file we are using is in PDB format, but the rxDock tool we use for docking requires an input in MOL2 format. Therefore, we first convert the file.

Hands On: Conversion to MOL2 format

Compound conversion ( Galaxy version 3.1.1+galaxy0) with the following parameters:

“Molecular input file”: Receptor PDB dataset.

“Output format”: Sybyl Mol2 format (mol2)

Rename the output file Receptor MOL2.

Generate Frankenstein ligand

For docking with rxDock, a file needs to be created defining the active site. This requires two input files - one for the protein and one for the ligand. We want an active site generation that takes into account the features of all 17 fragments, and therefore need to generate a ‘Frankenstein ligand’ which possesses the properties of all the fragments. A very simple Galaxy tool is available for this.

Question

What is a ‘Frankenstein ligand’ and why do we need it?

The Frankenstein ligand combines the atoms of all the fragments and therefore occupies the entire space of the binding site. Therefore, it is the best choice for cavity definition. See the information provided by InformaticsMatters for more details.

Hands On: Generate Frankenstein ligand

Create Frankenstein ligand ( Galaxy version 2013.1-0+galaxy0) with the following parameters:

“Input file”: Hits SDF

Rename the file to Frankstein SDF.

Generate active site definition

The active site can now be generated using the rbcavity tool tool, which requires the receptor in MOL2 format as input as well as a single reference ligand in Mol/SDF format. We use the Frankenstein ligand as the reference.

Hands On: Active site preparation

rxDock cavity definition ( Galaxy version 0.1.5) with the following parameters:

“Receptor”: Receptor MOL2

“Reference ligand”: Frankenstein SDF

“Mapper sphere radius”: 3.0

“Mapper small sphere radius”: 1.0

“Mapper minimum volume”: 100

“Mapper volume increment”: 0

“Mapper grid step”: 0.5

“Cavity weight”: 1.0

Rename the output file Active site.

Comment

The meanings of these parameters are too complex to go into in this tutorial. If you are interested, see the rDock documentation for more details.

Docking and scoring

Docking and scoring are now performed, using the following steps: 1) docking using rxDock, 2) recombining the results into a single SDF file, 3) TransFS scoring, and 4) SuCOS scoring.

This section in the original study differed from this tutorial in the following ways:

Docking was performed on over 100,000 enumerated candidates, rather than the 300 used here.

25 different poses were generated per candidate, rather than 5, as in this tutorial.

Because of the large number of poses to score (more than a million), the scoring steps were parallelized by splitting into collections. This is skipped in the tutorial.

The entire process was repeated 17 times, using a different fragment hit as the receptor structure each time. See the Docking for more details. A full list of Galaxy histories generated is listed the histories page.

Docking with rxDock

Hands On: Docking

rxDock docking ( Galaxy version 0.1.5) with the following parameters:

“Receptor”: Receptor MOL2

“Active site”: Active site

“Ligands”: Split file collection

“Number of dockings”: 5

“Number of best poses”: 5

Comment

For more information about docking, check out the introductory tutorial. It uses a different tool, AutoDock Vina, rather than rxDock, but the general principles are the same.

Collapse collection to a single file

Having created a collection to parallelize the docking procedure, we can now recombine the results to a single file.

Hands On: Collapse collection

Collapse Collection ( Galaxy version 4.2) with the following parameters:

“Collection of files to collapse into single dataset”: Output of docking step

Rename to Docked poses SDF.

The output file should contain around 1,900 docked poses in SDF format.

Pose scoring with TransFS

In this step, we carry out scoring of the poses using TransFS. This is a deep learning approach developed at the University of Oxford, employing augmentation of training data with incorrectly docked ligands to prompt the model to learn from protein-ligand interactions. (Scantlebury 2019)

The TransFS scoring returns a value (saved as <TransFSScore> in the SDF file) between 0 (poor) and 1 (good).

Hands On: TransFS scoring

XChem TransFS pose scoring ( Galaxy version 0.4.0) with the following parameters:

“Receptor”: Receptor PDB

“Ligands”: Docked poses SDF

“Distance to waters”: 2

Pose scoring with SuCOS

This step involves scoring of the poses from each molecule against the original fragment screening hit ligands using the SuCOS MAX shape and feature overlay algorithm. (Leong 2019) The conformation and position of the poses are compared to known structures (i.e. the fragment hits) to determine a score.

SuCOS scoring returns a value (saved as <Max_SuCOS_Score> in the SDF file) between 0 (poor) and 1 (good).

Hands On: SuCOS scoring

Max SuCOS score ( Galaxy version 2020.03.4+galaxy0) with the following parameters:

“Ligands to be scored”: Output of the TransFS step

“Set of clusters to score against”: Hits SDF

Rename the output file to Scored poses.

Compound selection

The aim of the original study was to select 500 candidate molecules for synthesis and experimental study. In order to do this, the data for all fragment hits had to be combined (i.e. so that each compound was assigned the lowest score from all the fragments). The resulting table was then compared with a list of compounds available from Enamine and Chemspace and the 500 highest scoring matching compounds were selected for purchase.

This step is skipped in the tutorial, as only 100 compounds were tested, using only a single fragment hit structure. If you want, though, check out the history and workflow used.

Conclusion

This tutorial guided you through docking and scoring of a small set of compounds to the MPro protein. Hopefully, you have a better understanding of how docking can be practically used, as well as how the original study was performed.

You've Finished the Tutorial

Key points

Galaxy can support large, rapid studies in computational chemistry

Protein-ligand docking contributes to the discovery of new drugs

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

O’Boyle, N. M., M. Banck, C. A. James, C. Morley, T. Vandermeersch et al., 2011 Open Babel: An open chemical toolbox. Journal of Cheminformatics 3: 10.1186/1758-2946-3-33
Ruiz-Carmona, S., D. Alvarez-Garcia, N. Foloppe, A. B. Garmendia-Doval, S. Juhos et al., 2014 rDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids. PLOS Computational Biology 10: 1–7. 10.1371/journal.pcbi.1003571
Rose, A. S., A. R. Bradley, Y. Valasatava, J. M. Duarte, A. Prlić et al., 2018 NGL viewer: web-based molecular graphics for large complexes. Bioinformatics 34: 3755–3758. 10.1093/bioinformatics/bty419
Leong, S. et al., 2019 SuCOS is Better than RMSD for Evaluating Fragment Elaboration and Docking Poses. 10.26434/chemrxiv.8100203.v1 https://chemrxiv.org/articles/SuCOS_is_Better_than_RMSD_for_Evaluating_Fragment_Elaboration_and_Docking_Poses/8100203/1
Ropp, P. J., J. C. Kaminsky, S. Yablonski, and J. D. Durrant, 2019 Dimorphite-DL: an open-source program for enumerating the ionization states of drug-like small molecules. Journal of Cheminformatics 11: 10.1186/s13321-019-0336-9
Scantlebury, J. et al., 2019 Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalise To Unseen Target Classes, And Highlight Important Binding Interactions. https://www.biorxiv.org/content/10.1101/2020.03.06.979625v1

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Simon Bray, Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/covid19-docking/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{computational-chemistry-covid19-docking,
author = "Simon Bray",
	title = "Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/covid19-docking/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/computational-chemistry/tutorials/covid19-docking/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: ctb_frankenstein_ligand
  owner: bgruening
  revisions: 8e214e52e461
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ctb_frankenstein_ligand
  owner: bgruening
  revisions: 7255688c77f3
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: enumerate_charges
  owner: bgruening
  revisions: 2a868592ebcb
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: enumerate_charges
  owner: bgruening
  revisions: 67ee76f0e497
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openbabel_compound_convert
  owner: bgruening
  revisions: b59c91adeac1
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openbabel_compound_convert
  owner: bgruening
  revisions: e2c36f62e22f
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: rdock_rbcavity
  owner: bgruening
  revisions: 744a777e9f90
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: rdock_rbdock
  owner: bgruening
  revisions: a22969b08177
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: rxdock_rbcavity
  owner: bgruening
  revisions: 7fac8e7a12d5
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: rxdock_rbdock
  owner: bgruening
  revisions: 6fe6b0e10ed6
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: split_file_to_collection
  owner: bgruening
  revisions: '0046692724f9'
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: split_file_to_collection
  owner: bgruening
  revisions: 6cbe2f30c2d7
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: sucos_max_score
  owner: bgruening
  revisions: d4c67ced6abc
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: sucos_max_score
  owner: bgruening
  revisions: 9b48456a96fe
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: xchem_transfs_scoring
  owner: bgruening
  revisions: f6f9b47d02b6
  tool_panel_section_label: ChemicalToolBox
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: collapse_collections
  owner: nml
  revisions: 830961c48e42
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 2

t{ hist[0] | to_stars }} 1

t{ hist[0] | to_stars }} 1

t{ hist[0] | to_stars }} 1

September 2023

4 stars: Liked: Very clear. Most of it worked. Disliked: "Pose scoring with TransFS" crashed for me. Error message said "No score found for record xxxx" for every molecule