Intro¶
After reading http://www.cybertec.at/2016/02/postgresql-on-hardware-vs-postgresql-on-virtualbox/ from Hans-Jürgen Schönig about performance differences between PostgreSQL running natively and PostgreSQL running in a VM, I got curious about the impact of virtualization on the RDKit. This is a brief exploration of that topic.
Test setup¶
Some technical details about the experiments first:
- Code version: this is experiment was done using the 2015.09.2 version of the RDKit available from https://anaconda.org/rdkit/rdkit.
- Testing code: I used the code
$RDBASE/Regress/Scripts/new_timings.py
. This is a more time consuming version of the standard RDKit benchmarking tests. - Python 3.5.1, from anaconda python
- Test machine: a Dell XPS desktop with a 3.6GHz i7-4790 CPU and 16GB of RAM.
- Test OS (physical machine): Ubuntu 15.10
- Vagrant configuration: Ubuntu 14.04 (Trusty) running in a 4GB VirtualBox VM
- Docker configuration: Debian Jessie using Docker 1.10 (the post about setting this up is coming, but see the bottom of the post for the very minimal Dockerfile I used)
Details of the tests:¶
The test set is 50K molecules pulled from ZNP (a subset that no longer exists) a few years ago.
- Building the molecules from SMILES
- Generating canonical SMILES
- Reading 10K molecules from SDF
- Constructing 823 queries from SMARTS
- Running
HasSubstructMatch()
for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected) - Running
GetSubstructMatches()
for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected) - Reading the 428 SMARTS from
$RDBASE/Data/SmartsLib/RLewis_smarts.txt
- Running
HasSubstructMatch()
for the 50K molecules and the 428 SMARTS - Running
GetSubstructMatches()
for the 50K molecules and the 428 SMARTS - Generating 50K mol blocks
- Calling
Chem.BRICS.BreakBRICSBonds()
on the 50K molecules - Generating 2D coordinates for the 50K molecules
- Generating the RDKit fingerprint for the 50K molecules
- Generating Morgan fingerprints for the 50K molecules
Note that none of these need to do much in the way of I/O.
Results¶
Env | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 | T14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Physical | 12.6 | 6.1 | 5.0 | 0.0 | 56.3 | 60.7 | 0.0 | 163.6 | 168.7 | 18.5 | 44.6 | 15.8 | 64.8 | 5.0 |
Vagrant | 12.9 | 6.5 | 5.0 | 0.1 | 56.0 | 61.4 | 0.0 | 164.2 | 168.5 | 19.3 | 45.5 | 16.1 | 68.5 | 5.1 |
Docker | 12.6 | 6.2 | 4.9 | 0.0 | 54.5 | 59.8 | 0.0 | 161.5 | 162.6 | 18.4 | 43.8 | 15.4 | 67.9 | 5.0 |
Conclusions¶
Comfortingly, running the code in a virtual environment or container doesn't have much, if any, impact on performance for this CPU-intensive test.
Technical details¶
Using Docker¶
Since ContinuumIO makes Docker images with miniconda preconfigured available, this turns out to be really simple. Here's the Dockerfie I used:
FROM continuumio/miniconda3
MAINTAINER Greg Landrum <greg.landrum@gmail.com>
ENV PATH /opt/conda/bin:$PATH
ENV LANG C
# install the RDKit:
RUN conda config --add channels https://conda.anaconda.org/rdkit
RUN conda install -y rdkit
You can put that in an empty directory and then build a local image with the RDKit installed by running:
docker build -t basic_conda .
I wanted to mirror my local RDKit checkout into the image when I ran it so that I had access to the Regress
directory. This is easy to do:
docker run -i -t -v /scratch/RDKit_git:/opt/RDKit_git basic_conda /bin/bash
And then I ran the benchmark with:
cd /opt/RDKit_git/Regress/Scripts
python new_timing.py
No comments:
Post a Comment