Intro¶

After reading http://www.cybertec.at/2016/02/postgresql-on-hardware-vs-postgresql-on-virtualbox/ from Hans-Jürgen Schönig about performance differences between PostgreSQL running natively and PostgreSQL running in a VM, I got curious about the impact of virtualization on the RDKit. This is a brief exploration of that topic.

Test setup¶

Some technical details about the experiments first:

Code version: this is experiment was done using the 2015.09.2 version of the RDKit available from https://anaconda.org/rdkit/rdkit.
Testing code: I used the code $RDBASE/Regress/Scripts/new_timings.py. This is a more time consuming version of the standard RDKit benchmarking tests.
Python 3.5.1, from anaconda python
Test machine: a Dell XPS desktop with a 3.6GHz i7-4790 CPU and 16GB of RAM.
Test OS (physical machine): Ubuntu 15.10
Vagrant configuration: Ubuntu 14.04 (Trusty) running in a 4GB VirtualBox VM
Docker configuration: Debian Jessie using Docker 1.10 (the post about setting this up is coming, but see the bottom of the post for the very minimal Dockerfile I used)

Details of the tests:¶

The test set is 50K molecules pulled from ZNP (a subset that no longer exists) a few years ago.

Building the molecules from SMILES
Generating canonical SMILES
Reading 10K molecules from SDF
Constructing 823 queries from SMARTS
Running HasSubstructMatch() for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
Running GetSubstructMatches() for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
Reading the 428 SMARTS from $RDBASE/Data/SmartsLib/RLewis_smarts.txt
Running HasSubstructMatch() for the 50K molecules and the 428 SMARTS
Running GetSubstructMatches() for the 50K molecules and the 428 SMARTS
Generating 50K mol blocks
Calling Chem.BRICS.BreakBRICSBonds() on the 50K molecules
Generating 2D coordinates for the 50K molecules
Generating the RDKit fingerprint for the 50K molecules
Generating Morgan fingerprints for the 50K molecules

Note that none of these need to do much in the way of I/O.

Results¶

Env	T1	T2	T3	T4	T5	T6	T8	T9	T10	T11	T12	T13	T14
Physical	12.6	6.1	5.0	0.0	56.3	60.7	163.6	168.7	18.5	44.6	15.8	64.8	5.0
Vagrant	12.9	6.5	5.0	0.1	56.0	61.4	164.2	168.5	19.3	45.5	16.1	68.5	5.1
Docker	12.6	6.2	4.9	0.0	54.5	59.8	161.5	162.6	18.4	43.8	15.4	67.9	5.0

Conclusions¶

Comfortingly, running the code in a virtual environment or container doesn't have much, if any, impact on performance for this CPU-intensive test.

Technical details¶

Using Docker¶

Since ContinuumIO makes Docker images with miniconda preconfigured available, this turns out to be really simple. Here's the Dockerfie I used:

FROM continuumio/miniconda3
MAINTAINER Greg Landrum <greg.landrum@gmail.com>

ENV PATH /opt/conda/bin:$PATH
ENV LANG C

# install the RDKit:
RUN conda config --add channels  https://conda.anaconda.org/rdkit
RUN conda install -y rdkit

You can put that in an empty directory and then build a local image with the RDKit installed by running:

docker build -t basic_conda .

I wanted to mirror my local RDKit checkout into the image when I ran it so that I had access to the Regress directory. This is easy to do:

docker run -i -t -v /scratch/RDKit_git:/opt/RDKit_git basic_conda /bin/bash

And then I ran the benchmark with:

cd /opt/RDKit_git/Regress/Scripts
python new_timing.py

Sunday, March 6, 2016

The performance impact of virtualization on some basic RDKit tasks