Saturday, October 5, 2019

A new "Lessel and Briem like" dataset

This is another one that ends up being a bit beyond what blogger can handle. So the post is just a summary, full details are in the jupyter notebook in github, here's the nbviewer link.

Background and Intro

Back in the dark ages before we had access to ChEMBL (it's hard to believe that it was only 10 years ago!) there weren't a lot of datasets available for testing/validating new computational methods. One that became something of a standard for the validation of similarity-based methods was published by Uta Lessel and Hans Briem back in 2000: Lessel, U. F. & Briem, H. Flexsim-X:  A Method for the Detection of Molecules with Similar Biological Activity. J. Chem. Inf. Comput. Sci. 40, 246–253 (2000). There's also a followup paper from later that year that uses the dataset again Briem, H. & Lessel, U. F. In vitro and in silico affinity fingerprints: Finding similarities beyond structural classes. Perspectives in Drug Discovery and Design 20, 231–244 (2000). This dataset was derived from the MDDR ("MACCS Drug Data Report") and included five sets of actives (ligands for particular targets) and a set of assumed inactives (compounds drawn randomly from the MDDR).
At the RDKit UGM last week Roger Sayle - another member of the RDKit community who remembers and bears the scars from the pre-ChEMBL dark ages - mentioned the Briem and Lessel dataset and used it in his presentation about a new clustering algorithm implementation. I had some questions about Roger's presentation and wanted to play with the data a little bit, but I encountered a problem: licensing restrictions on the MDDR prevented Lessel and Briem from publishing the structure of the compounds they used. They did include the MDDR IDs in the supplementary material of the paper, so anyone with a copy of the MDDR could reproduce their work. (An aside: this seems pretty normal these days, but back then it wasn't all that common.) However, I don't have an MDDR license and Roger can't legally send me the data, so there's a problem. Since we do have ChEMBL now, I decided to create a similar type of dataset that I could use to answer some questions about similarity and activity. As you might expect, this led me off on a big yak-shaving expedition, but I did get a few potentially interesting datasets along with ideas/material for some blog posts. This is the first of those.
First a word on why I didn't just use the ChEMBL datasets from either our benchmarking paper (https://doi.org/10.1186/1758-2946-5-26) or the model-fusion paper (https://doi.org/10.1021/ci400466r). The dataset from the first paper (and "datasets 1" from the second one) are derived from a publication from Heilkamp and Bajorath (https://doi.org/10.1021/ci200199u) and include "actives" for a set of targets along with assumed inactives randomly picked from ChEMBL. For those sets "active" was defined to be a potency < 10uM against the target. For these new datasets I wanted to use a much stricter definition of "active".

I will do another post with a bit more detail about how I constructed the dataset (along with code), but here's a quick summary. For the purposes of this dataset I worked with ChEMBL25 and only considered Ki values (so that I can compare results between assays on the same target) and an "active" is defined to be a compound with a Ki value <= 1nM (ignoring qualified values). I then picked targets that have at least 200 "actives". This leaves me with 35 targets with between 205 and 934 "actives", listed below. For later use I also defined "inactive" for these targets to be compounds with a Ki value >= 100nM (again, ignoring qualified values). The exported dataset includes SMILES, ChEMBL IDs, active/inactive labels and measured Ki values
I also needed a set of decoys that we will assume are inactive. For this dataset I picked molecules from ChEMBL25 that have a molecular weight between 250 and 800. Since they're assumed inactives, I figured I should avoid picking anything that has actually been seen to be active in any assay, so I further filtered the set to compounds that have only been tested in a single assay. I required that the single assay be one that measures IC50 and that the measured IC50 be > 1uM. I randomly picked about 4000 molecules satisfying these criteria.
It's worth mentioning at least one other dataset of the Lessel and Briem type (it's a blog, I'm not doing a full literature survey here): a larger (11 targets instead of 5), also MDDR-based dataset, for the validation of similarity-based virtual screening methods was published by the Sheffield and Novartis groups in 2004: Hert, J. et al. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. Journal of Chemical Information and Modeling 44, 1177–1185 (2004). That dataset was also interesting, and the paper is an absolute classic, but we're not talking about that here.

Looking at the Lessel and Briem datasets

The goal of this exercise is to come up with something analagous to the Lessel and Briem dataset for at least some similarity comparison tests. Let's start with characterizing that data.
As I mentioned above, I don't have access to the data, but Roger agreed to run some comparisons for me and share the results with me. Here are the images along with a brief bit of description. The code to generate the images using the new dataset is below.
Let's start by creating plots similar to what Roger showed. To generate these, at each Tanimoto threshold point we look at the neighbor list for every active compound and calculate the fraction of neighbors that are from the same activity class. A "neighbor" in this context is any compound that has a similarity value to the active compound that is >= the threshold value. For the purposes of this plot, each compound is included in its own neighbor list.
Here's the plot showing the curve for each of the five targets in the B&L sets using the MFP2 fingerprint (Morgan radius 2 with default parameters) to generate similarity. The vertical black line indicates the "noise threshold" for the MFP2 fingerprint that I calculated in an earlier blog post.All five curves for the RDKit fingerprint (default parameters).
To make comparing the fingerprints easier, here's the averages of the two sets of curves:
Another plot worth looking out is the in-class similarity for each class of actives:and for the RDKit fingerprintIn each of these plots class 0 corresponds to the in-class similarity for the decoy set.
Unsurprisingly the RDKit fingerprints are shifted towards higher similarity values than MFP2. The RDKit fingerprint tends to have more bits set for a given molecule (I should do a blog post on that too) and the "noise threshold" is significantly higher (0.51 instead of 0.31).
What we see here more or less matches what Roger has in slides 4 and 5 of his presentation, with the addition of the RDKit fingerprints.
I'm going to skip a huge amount of detail here since blogger isn't cooperating.

Let's jump ahead to the end. Apologies that things get even uglier than normal from here on down:

Look at the cumulative distributions
In [25]:
In [26]:
That looks a lot better. Let's try the other plots again:
In [27]:
In [28]:
These are still shifted quite a bit further to the left than the plots for the original data. This is very likely due to the differences in the way decoys are used in the two experiments.
For now, let's take a look at the targets that have survived the analysis:
In [31]:
Out[31]:
tidchembl_idpref_nameorganismnActivesnInactives
111CHEMBL204ThrombinHomo sapiens4522098
215CHEMBL205Carbonic anhydrase IIHomo sapiens2833094
410280CHEMBL264Histamine H3 receptorHomo sapiens492527
551CHEMBL214Serotonin 1a (5-HT1a) receptorHomo sapiens2291079
672CHEMBL217Dopamine D2 receptorHomo sapiens2733502
710841CHEMBL4552Peripheral-type benzodiazepine receptorRattus norvegicus205188
9107CHEMBL224Serotonin 2a (5-HT2a) receptorHomo sapiens290981
10125CHEMBL229Alpha-1a adrenergic receptorHomo sapiens250154
12130CHEMBL234Dopamine D3 receptorHomo sapiens3731369
21194CHEMBL244Coagulation factor XHomo sapiens6371752
2214037CHEMBL339Dopamine D2 receptorRattus norvegicus2601422
24252CHEMBL251Adenosine A2a receptorHomo sapiens2382449
25259CHEMBL253Cannabinoid CB2 receptorHomo sapiens2771660
2810529CHEMBL270Mu opioid receptorRattus norvegicus580940
2911085CHEMBL1946Melatonin receptor 1BHomo sapiens26578
3010576CHEMBL273Serotonin 1a (5-HT1a) receptorRattus norvegicus2501223
31104290CHEMBL1907596Neuronal acetylcholine receptor; alpha4/beta2Rattus norvegicus529800
3210627CHEMBL3371Serotonin 6 (5-HT6) receptorHomo sapiens256726
3312198CHEMBL313Serotonin transporterRattus norvegicus2201272
And let's write the actives from those sets out to a file so that it's easy for others to use them later. The decoys are in the decoys file ../data/BLSets_singleic50_decoys.txt.
The data, as well as this notebook, are all in the github repo for this blog: https://github.com/greglandrum/rdkit_blog
In [33]:

The data for this is all in github:
  • The final set of actives picked: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_selected_actives.txt
  • The decoys: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_singleic50_decoys.txt
  • The full collection of Ki date: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives.txt
  • Metadata about the targets: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives_target_key.txt

Many thanks to Roger for the inspiration for this, generating the data on the original dataset, and for an email discussion as I was putting this together.

No comments: