After reading http://www.cybertec.at/2016/02/postgresql-on-hardware-vs-postgresql-on-virtualbox/ from Hans-Jürgen Schönig about performance differences between PostgreSQL running natively and PostgreSQL running in a VM, I got curious about the impact of virtualization on the RDKit. This is a brief exploration of that topic.
Some technical details about the experiments first:
- Code version: this is experiment was done using the 2015.09.2 version of the RDKit available from https://anaconda.org/rdkit/rdkit.
- Testing code: I used the code
$RDBASE/Regress/Scripts/new_timings.py. This is a more time consuming version of the standard RDKit benchmarking tests.
- Python 3.5.1, from anaconda python
- Test machine: a Dell XPS desktop with a 3.6GHz i7-4790 CPU and 16GB of RAM.
- Test OS (physical machine): Ubuntu 15.10
- Vagrant configuration: Ubuntu 14.04 (Trusty) running in a 4GB VirtualBox VM
- Docker configuration: Debian Jessie using Docker 1.10 (the post about setting this up is coming, but see the bottom of the post for the very minimal Dockerfile I used)
Details of the tests:¶
The test set is 50K molecules pulled from ZNP (a subset that no longer exists) a few years ago.
- Building the molecules from SMILES
- Generating canonical SMILES
- Reading 10K molecules from SDF
- Constructing 823 queries from SMARTS
HasSubstructMatch()for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
GetSubstructMatches()for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
- Reading the 428 SMARTS from
HasSubstructMatch()for the 50K molecules and the 428 SMARTS
GetSubstructMatches()for the 50K molecules and the 428 SMARTS
- Generating 50K mol blocks
Chem.BRICS.BreakBRICSBonds()on the 50K molecules
- Generating 2D coordinates for the 50K molecules
- Generating the RDKit fingerprint for the 50K molecules
- Generating Morgan fingerprints for the 50K molecules
Note that none of these need to do much in the way of I/O.
Comfortingly, running the code in a virtual environment or container doesn't have much, if any, impact on performance for this CPU-intensive test.
Since ContinuumIO makes Docker images with miniconda preconfigured available, this turns out to be really simple. Here's the Dockerfie I used:
FROM continuumio/miniconda3 MAINTAINER Greg Landrum <firstname.lastname@example.org> ENV PATH /opt/conda/bin:$PATH ENV LANG C # install the RDKit: RUN conda config --add channels https://conda.anaconda.org/rdkit RUN conda install -y rdkit
You can put that in an empty directory and then build a local image with the RDKit installed by running:
docker build -t basic_conda .
I wanted to mirror my local RDKit checkout into the image when I ran it so that I had access to the
Regress directory. This is easy to do:
docker run -i -t -v /scratch/RDKit_git:/opt/RDKit_git basic_conda /bin/bash
And then I ran the benchmark with:
cd /opt/RDKit_git/Regress/Scripts python new_timing.py