Wednesday, October 30, 2019

Sharing conda environments

This isn't directly an RDKit topic, but it came up in a couple of recent conversations and certainly is relevant to RDKit users so I figured I'd do a short blog post.
I always encourage people to use the Anaconda Python distribution since conda does such a great job of managing binary dependencies and handling separate environments (well, and because then they won't have to build the RDKit themselves... that makes a huge difference for most folks). By this point I think/hope most people know this. What's perhaps less well known is how straightforward it is to share a conda environment with someone else. That's what this post is about.
Suppose you've created a script or Jupyter notebook that you would like to share with others. Perhaps you've written a paper and want to allow people to reproduce what you did or you've uploaded a package to GitHub and you want to make it straightforward for others to work with it. As you probably know already, manually capturing and communicating the dependencies for your code can be "a bit" of a pain. If you work inside of a conda environment it's easy.
Here's an example. I start by creating an environment named rdkit_demo that has a set of packages that I know I'm going to be using for my project:
(base) glandrum@otter:~/RDKit_blog$ conda create -n rdkit_demo python=3.7 jupyter psycopg2 matplotlib pandas scikit-learn rdkit::rdkit
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

   ...snip...

#
# To activate this environment, use
#
#     $ conda activate rdkit_demo
#
# To deactivate an active environment, use
#
#     $ conda deactivate
A quick note on that command: I specified that the rdkit package should be installed from the rdkit channel (that's the rdkit::rdkit syntax) so that I can be sure that I'm getting the rdkit build I did instead of the conda-forge build (There's nothing wrong with the conda-forge build! I'm just showing how to control what gets installed)
I work with this environment for a while and then realize that I also need dask, so I add that to my environment (notice from the prompt that I'm working in the rkdit_demo environment):
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda install dask
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

   ...snip...

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(rdkit_demo) glandrum@otter:~/RDKit_blog$
After some more work I am ready to share the notebook I was working on with others. I need to capture what I have installed so that they can also set up an environment that works.
The easiest and least error prone way for me to do this is to export the contents of the environment:
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda env export --from-history > rdkit_demo_env.yml
(rdkit_demo) glandrum@otter:~/RDKit_blog$ cat rdkit_demo_env.yml 
name: rdkit_demo
channels:
  - defaults
  - https://conda.anaconda.org/rdkit
  - conda-forge
dependencies:
  - scikit-learn
  - jupyter
  - rdkit::rdkit
  - pandas
  - python=3.7
  - psycopg2
  - matplotlib
  - dask
prefix: /home/glandrum/anaconda3/envs/rdkit_demo

(rdkit_demo) glandrum@otter:~/RDKit_blog$ 
I can create a new environment from the rdkit_demo_env.yml file like this:
(base) glandrum@otter:~/RDKit_blog$ conda env create -f rdkit_demo_env.yml 
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate rdkit_demo
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) glandrum@otter:~/RDKit_blog$ conda activate rdkit_demo
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda list | grep rdkit
# packages in environment at /home/glandrum/anaconda3/envs/rdkit_demo:
rdkit                     2019.09.1.0      py37hc20afe1_1    rdkit
Note that the environment file I created only includes the packages that I explicitly installed and only mentions package versions that I explicitly requested (in this case only the Python version was specified). If I want to include more detail, I have at least two options.
The first option is to edit the yml file and include version information for some/all of the packages I am installing. This is easy to do following the pattern seen above for the Python version. For example, if I want to specify that the most recent patch release from the previous RDKit release cycle should be used I would include this in the yml file:
  - rdkit::rdkit=2019.03.*
The second option is to have conda list full version information about all installed packages, including the dependencies of the manually installed packages:
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda env export
name: rdkit_demo
channels:
  - rdkit
  - defaults
  - https://conda.anaconda.org/rdkit
  - conda-forge
dependencies:
  - _libgcc_mutex=0.1=main
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py37_0
  - blas=1.0=mkl
  - bleach=3.1.0=py37_0
  - bokeh=1.3.4=py37_0
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2019.10.16=0
  - cairo=1.14.12=h8948797_3
  - certifi=2019.9.11=py37_0

   ...snip...

  - rdkit=2019.09.1.0=py37hc20afe1_1
  - readline=7.0=h7b6447c_5
  - scikit-learn=0.21.3=py37hd81dba3_0

   ...snip...

  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0
prefix: /home/glandrum/anaconda3/envs/rdkit_demo

(rdkit_demo) glandrum@otter:~/RDKit_blog$
This includes full version and build information, which is helpful if you need to reproduce exactly the same environment. One significant limitation of this approach is that the environment is now almost certainly tied to the operating system on which it was created.
You can find more information on managing conda environments in the conda documentation: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
As an aside: If I'm updating or installing on a new machine, I almost always install miniconda instead of the full Anaconda distribution. Anaconda is useful if you want to have most everything installed at once, but you pay the price of a huge install. Miniconda is, as the name indicates, minimal - you get just the stuff you need to setup an environment and then can manually install the packages you are interested in using. There's more on how to decide between the two here.

Saturday, October 5, 2019

A new "Lessel and Briem like" dataset

This is another one that ends up being a bit beyond what blogger can handle. So the post is just a summary, full details are in the jupyter notebook in github, here's the nbviewer link.

Background and Intro

Back in the dark ages before we had access to ChEMBL (it's hard to believe that it was only 10 years ago!) there weren't a lot of datasets available for testing/validating new computational methods. One that became something of a standard for the validation of similarity-based methods was published by Uta Lessel and Hans Briem back in 2000: Lessel, U. F. & Briem, H. Flexsim-X:  A Method for the Detection of Molecules with Similar Biological Activity. J. Chem. Inf. Comput. Sci. 40, 246–253 (2000). There's also a followup paper from later that year that uses the dataset again Briem, H. & Lessel, U. F. In vitro and in silico affinity fingerprints: Finding similarities beyond structural classes. Perspectives in Drug Discovery and Design 20, 231–244 (2000). This dataset was derived from the MDDR ("MACCS Drug Data Report") and included five sets of actives (ligands for particular targets) and a set of assumed inactives (compounds drawn randomly from the MDDR).
At the RDKit UGM last week Roger Sayle - another member of the RDKit community who remembers and bears the scars from the pre-ChEMBL dark ages - mentioned the Briem and Lessel dataset and used it in his presentation about a new clustering algorithm implementation. I had some questions about Roger's presentation and wanted to play with the data a little bit, but I encountered a problem: licensing restrictions on the MDDR prevented Lessel and Briem from publishing the structure of the compounds they used. They did include the MDDR IDs in the supplementary material of the paper, so anyone with a copy of the MDDR could reproduce their work. (An aside: this seems pretty normal these days, but back then it wasn't all that common.) However, I don't have an MDDR license and Roger can't legally send me the data, so there's a problem. Since we do have ChEMBL now, I decided to create a similar type of dataset that I could use to answer some questions about similarity and activity. As you might expect, this led me off on a big yak-shaving expedition, but I did get a few potentially interesting datasets along with ideas/material for some blog posts. This is the first of those.
First a word on why I didn't just use the ChEMBL datasets from either our benchmarking paper (https://doi.org/10.1186/1758-2946-5-26) or the model-fusion paper (https://doi.org/10.1021/ci400466r). The dataset from the first paper (and "datasets 1" from the second one) are derived from a publication from Heilkamp and Bajorath (https://doi.org/10.1021/ci200199u) and include "actives" for a set of targets along with assumed inactives randomly picked from ChEMBL. For those sets "active" was defined to be a potency < 10uM against the target. For these new datasets I wanted to use a much stricter definition of "active".

I will do another post with a bit more detail about how I constructed the dataset (along with code), but here's a quick summary. For the purposes of this dataset I worked with ChEMBL25 and only considered Ki values (so that I can compare results between assays on the same target) and an "active" is defined to be a compound with a Ki value <= 1nM (ignoring qualified values). I then picked targets that have at least 200 "actives". This leaves me with 35 targets with between 205 and 934 "actives", listed below. For later use I also defined "inactive" for these targets to be compounds with a Ki value >= 100nM (again, ignoring qualified values). The exported dataset includes SMILES, ChEMBL IDs, active/inactive labels and measured Ki values
I also needed a set of decoys that we will assume are inactive. For this dataset I picked molecules from ChEMBL25 that have a molecular weight between 250 and 800. Since they're assumed inactives, I figured I should avoid picking anything that has actually been seen to be active in any assay, so I further filtered the set to compounds that have only been tested in a single assay. I required that the single assay be one that measures IC50 and that the measured IC50 be > 1uM. I randomly picked about 4000 molecules satisfying these criteria.
It's worth mentioning at least one other dataset of the Lessel and Briem type (it's a blog, I'm not doing a full literature survey here): a larger (11 targets instead of 5), also MDDR-based dataset, for the validation of similarity-based virtual screening methods was published by the Sheffield and Novartis groups in 2004: Hert, J. et al. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. Journal of Chemical Information and Modeling 44, 1177–1185 (2004). That dataset was also interesting, and the paper is an absolute classic, but we're not talking about that here.

Looking at the Lessel and Briem datasets

The goal of this exercise is to come up with something analagous to the Lessel and Briem dataset for at least some similarity comparison tests. Let's start with characterizing that data.
As I mentioned above, I don't have access to the data, but Roger agreed to run some comparisons for me and share the results with me. Here are the images along with a brief bit of description. The code to generate the images using the new dataset is below.
Let's start by creating plots similar to what Roger showed. To generate these, at each Tanimoto threshold point we look at the neighbor list for every active compound and calculate the fraction of neighbors that are from the same activity class. A "neighbor" in this context is any compound that has a similarity value to the active compound that is >= the threshold value. For the purposes of this plot, each compound is included in its own neighbor list.
Here's the plot showing the curve for each of the five targets in the B&L sets using the MFP2 fingerprint (Morgan radius 2 with default parameters) to generate similarity. The vertical black line indicates the "noise threshold" for the MFP2 fingerprint that I calculated in an earlier blog post.All five curves for the RDKit fingerprint (default parameters).
To make comparing the fingerprints easier, here's the averages of the two sets of curves:
Another plot worth looking out is the in-class similarity for each class of actives:and for the RDKit fingerprintIn each of these plots class 0 corresponds to the in-class similarity for the decoy set.
Unsurprisingly the RDKit fingerprints are shifted towards higher similarity values than MFP2. The RDKit fingerprint tends to have more bits set for a given molecule (I should do a blog post on that too) and the "noise threshold" is significantly higher (0.51 instead of 0.31).
What we see here more or less matches what Roger has in slides 4 and 5 of his presentation, with the addition of the RDKit fingerprints.
I'm going to skip a huge amount of detail here since blogger isn't cooperating.

Let's jump ahead to the end. Apologies that things get even uglier than normal from here on down:

Look at the cumulative distributions
In [25]:
In [26]:
That looks a lot better. Let's try the other plots again:
In [27]:
In [28]:
These are still shifted quite a bit further to the left than the plots for the original data. This is very likely due to the differences in the way decoys are used in the two experiments.
For now, let's take a look at the targets that have survived the analysis:
In [31]:
Out[31]:
tidchembl_idpref_nameorganismnActivesnInactives
111CHEMBL204ThrombinHomo sapiens4522098
215CHEMBL205Carbonic anhydrase IIHomo sapiens2833094
410280CHEMBL264Histamine H3 receptorHomo sapiens492527
551CHEMBL214Serotonin 1a (5-HT1a) receptorHomo sapiens2291079
672CHEMBL217Dopamine D2 receptorHomo sapiens2733502
710841CHEMBL4552Peripheral-type benzodiazepine receptorRattus norvegicus205188
9107CHEMBL224Serotonin 2a (5-HT2a) receptorHomo sapiens290981
10125CHEMBL229Alpha-1a adrenergic receptorHomo sapiens250154
12130CHEMBL234Dopamine D3 receptorHomo sapiens3731369
21194CHEMBL244Coagulation factor XHomo sapiens6371752
2214037CHEMBL339Dopamine D2 receptorRattus norvegicus2601422
24252CHEMBL251Adenosine A2a receptorHomo sapiens2382449
25259CHEMBL253Cannabinoid CB2 receptorHomo sapiens2771660
2810529CHEMBL270Mu opioid receptorRattus norvegicus580940
2911085CHEMBL1946Melatonin receptor 1BHomo sapiens26578
3010576CHEMBL273Serotonin 1a (5-HT1a) receptorRattus norvegicus2501223
31104290CHEMBL1907596Neuronal acetylcholine receptor; alpha4/beta2Rattus norvegicus529800
3210627CHEMBL3371Serotonin 6 (5-HT6) receptorHomo sapiens256726
3312198CHEMBL313Serotonin transporterRattus norvegicus2201272
And let's write the actives from those sets out to a file so that it's easy for others to use them later. The decoys are in the decoys file ../data/BLSets_singleic50_decoys.txt.
The data, as well as this notebook, are all in the github repo for this blog: https://github.com/greglandrum/rdkit_blog
In [33]:

The data for this is all in github:
  • The final set of actives picked: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_selected_actives.txt
  • The decoys: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_singleic50_decoys.txt
  • The full collection of Ki date: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives.txt
  • Metadata about the targets: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives_target_key.txt

Many thanks to Roger for the inspiration for this, generating the data on the original dataset, and for an email discussion as I was putting this together.