Sunday, March 6, 2016

The performance impact of virtualization on some basic RDKit tasks

Intro

After reading http://www.cybertec.at/2016/02/postgresql-on-hardware-vs-postgresql-on-virtualbox/ from Hans-Jürgen Schönig about performance differences between PostgreSQL running natively and PostgreSQL running in a VM, I got curious about the impact of virtualization on the RDKit. This is a brief exploration of that topic.

Test setup

Some technical details about the experiments first:

  • Code version: this is experiment was done using the 2015.09.2 version of the RDKit available from https://anaconda.org/rdkit/rdkit.
  • Testing code: I used the code $RDBASE/Regress/Scripts/new_timings.py. This is a more time consuming version of the standard RDKit benchmarking tests.
  • Python 3.5.1, from anaconda python
  • Test machine: a Dell XPS desktop with a 3.6GHz i7-4790 CPU and 16GB of RAM.
  • Test OS (physical machine): Ubuntu 15.10
  • Vagrant configuration: Ubuntu 14.04 (Trusty) running in a 4GB VirtualBox VM
  • Docker configuration: Debian Jessie using Docker 1.10 (the post about setting this up is coming, but see the bottom of the post for the very minimal Dockerfile I used)

Details of the tests:

The test set is 50K molecules pulled from ZNP (a subset that no longer exists) a few years ago.

  1. Building the molecules from SMILES
  2. Generating canonical SMILES
  3. Reading 10K molecules from SDF
  4. Constructing 823 queries from SMARTS
  5. Running HasSubstructMatch() for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
  6. Running GetSubstructMatches() for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)
  7. Reading the 428 SMARTS from $RDBASE/Data/SmartsLib/RLewis_smarts.txt
  8. Running HasSubstructMatch() for the 50K molecules and the 428 SMARTS
  9. Running GetSubstructMatches() for the 50K molecules and the 428 SMARTS
  10. Generating 50K mol blocks
  11. Calling Chem.BRICS.BreakBRICSBonds() on the 50K molecules
  12. Generating 2D coordinates for the 50K molecules
  13. Generating the RDKit fingerprint for the 50K molecules
  14. Generating Morgan fingerprints for the 50K molecules

Note that none of these need to do much in the way of I/O.

Results

Env T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
Physical 12.6 6.1 5.0 0.0 56.3 60.7 0.0 163.6 168.7 18.5 44.6 15.8 64.8 5.0
Vagrant 12.9 6.5 5.0 0.1 56.0 61.4 0.0 164.2 168.5 19.3 45.5 16.1 68.5 5.1
Docker 12.6 6.2 4.9 0.0 54.5 59.8 0.0 161.5 162.6 18.4 43.8 15.4 67.9 5.0

Conclusions

Comfortingly, running the code in a virtual environment or container doesn't have much, if any, impact on performance for this CPU-intensive test.

Technical details

Using Docker

Since ContinuumIO makes Docker images with miniconda preconfigured available, this turns out to be really simple. Here's the Dockerfie I used:

FROM continuumio/miniconda3
MAINTAINER Greg Landrum <greg.landrum@gmail.com>

ENV PATH /opt/conda/bin:$PATH
ENV LANG C

# install the RDKit:
RUN conda config --add channels  https://conda.anaconda.org/rdkit
RUN conda install -y rdkit

You can put that in an empty directory and then build a local image with the RDKit installed by running:

docker build -t basic_conda .

I wanted to mirror my local RDKit checkout into the image when I ran it so that I had access to the Regress directory. This is easy to do:

docker run -i -t -v /scratch/RDKit_git:/opt/RDKit_git basic_conda /bin/bash

And then I ran the benchmark with:

cd /opt/RDKit_git/Regress/Scripts
python new_timing.py

No comments: