Friday, November 15, 2019

Why the RDKit isn't available on PyPi

This keeps coming up and is an important question with a not-particularly-satisfying answer, so I want to capture the answer and explanation in one easy-to-point-to place...

Let me start with an important point: it would be awesome if we could distribute the RDKit via PyPi so that people could do pip install rdkit and have things just work. Unfortunately, I don't think this is possible.

The core problem is that the RDKit is not a pure python package; it's a mix of python and some compiled extension modules (shared libraries). On the Mac and Linux those compiled extension modules have a dependency on another set of shared libraries containing the core RDKit functionality. On all platforms the extension modules also have a dependency on some shared libraries containing functionality from the boost libraries. All of these libraries, in the proper versions, need to be installed in the appropriate places in order for from rdkit import Chem to work from Python. It's complex and not something that you would want to have to deal with on your own. It is, unfortunately, also not something that pip was designed to handle. Thus, you can't just "pip install rdkit"

Fortunately, we do have a solution to this whole mess: use conda to install the RDKit. conda was designed to solve exactly the cross-platform dependency management problem described above. And it does it pretty well. If you have misgivings about conda, please take a look at this post to make sure that your concerns aren't based on one of the common myths or misconceptions about conda: https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/
The post is long, but it's an easy read. Jake even points out how you can install conda inside of a standard python install (pip install conda) and then create conda environments there. So you don't even need to install either the anaconda or miniconda distributions yourself.

I would be really happy if it turns out that I'm wrong about this and it is, in fact, possible to get things set up so that the RDKit can be made "pip installable" and distributed through PyPi. But I've spent a fair amount of time on this and have not found anything that seems stable and supportable over the long term on all three supported operating systems (note that the "easy" solution of bundling all the RDKit and boost binaries into one giant wheel and just distributing that is neither stable nor supportable).

Answers to some hypothetical questions:

Can't you just statically link the RDKit code in the extension modules?

Theoretically, but it would increase the size of the packages pretty dramatically (the statically linked Windows conda packages are about twice as big as those for Linux or the Mac). Plus we still have to worry about the Boost dependencies

Can't you just statically link the Boost libraries?

All of the libraries except Boost::Python could be statically linked. But Boost::Python needs to be dynamically linked if you want to share types across extension modules. And we definitely want to be able to do that. Here's a reference for this.

Can't you just switch to using one extension module?

Theoretically, but that module would be huge and, I assume, take a very long time to load. This is a price that you would pay every time you do import rdkit for the first time in a python interpreter. It's also horrible. And we'd have to rewrite a bunch of wrapper code.

How about just not using Boost?

I'm pretty conservative about adding mandatory external dependencies to the RDKit, but when it comes to solving difficult problems like doing good C++-Python wrappers, I'm a huge fan of using high-quality libraries that other people write and support. 
There's some chance that we could use pybind11 as a replacement for Boost::Python - the project looks great and seems quite active - but even doing a test to see whether or not that's possible is a seriously non-trivial amount of work, so it's hard to even find the time to get started with that. it's hard.


No comments: