Friday, April 28, 2017

Using custom fingerprints in PostgreSQL

A question recently came up on the mailing list about how to use custom fingerprints with the RDKit PostgreSQL cartridge without having to edit the cartridge code itself. Since the answer isn't trivial but may be useful to others, I'm doing a blog post with the answer.

We'll start the usual way, with a bunch of imports. In the interests of being maximally explicit and having this whole notebook be normal Python code, I'm handling the database connection with the usual psycopg2 connector to PostgreSQL.

In [1]:
from rdkit import Chem
from rdkit import DataStructs
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import psycopg2,time
print(time.asctime(),"rdkit=",rdBase.rdkitVersion)
Wed Apr 26 13:22:19 2017 rdkit= 2017.03.1

Read in a set of molecule data that we've used before:

In [2]:
with open('../data/chembl16_2010-2012.smi') as inf:
    data = [x.strip().split() for x in inf]
len(data)
Out[2]:
234681

Establish a connection to the database we'll use and insert that data:

In [3]:
conn = psycopg2.connect(database='rdkit_blog_demo')
curs = conn.cursor()
curs.execute('create extension if not exists rdkit')
curs.execute('drop table if exists raw_data')
curs.execute('create table raw_data (smiles text,molregno int)')
curs.executemany('insert into raw_data values (%s,%s)',data)
conn.commit()

Create a molecule table with the first 10K rows (we don't need all the data for the purposes of this post):

In [4]:
curs.execute('drop table if exists mols')
curs.execute('select molregno,mol_from_smiles(smiles::cstring) as m into mols from raw_data limit 10000')
conn.commit()

Now generate our custom fingerprints for those molecules. We're using one of the Atom Pair/Topological Torsion variants that the RDKit provides here:

In [5]:
from rdkit.Chem.AtomPairs import Sheridan
from rdkit.Chem import rdMolDescriptors
fps = []
# grab the molecules, we're pulling them out in their pickled form:
curs.execute('select molregno,mol_send(m) from mols limit 10000')
for molregno,pkl in curs.fetchall():
    if pkl is None: continue
    # construct a molecule
    m = Chem.Mol(pkl.tobytes())
    # now do our fingerprint. We're using Topological Torsions with Sheridan's binding properties
    # to define atom types
    fp = Sheridan.GetBTFingerprint(m,fpfn=rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect)
    fps.append((molregno,fp))

Now insert the fingerprints into the database, we do this by sending a byte string to the database using DataStructs.BitVectToBinaryText() on the python side and then converting that into a bit vector in the database using the function bfp_from_binary_text():

In [7]:
curs.execute('drop table if exists fps')
curs.execute('create table fps (molregno int, btfp bfp)')
curs.executemany('insert into fps values (%s,bfp_from_binary_text(%s))',
                 [(x,DataStructs.BitVectToBinaryText(y)) for x,y in fps])
conn.commit()

If this were a larger database I'd also create an index on the fingerprint column in order to speed similarity searches up a bit, but since this one's small we'll skip that.

And here's how you do a query:

In [8]:
fp = fps[-1][-1]
curs.execute('set rdkit.tanimoto_threshold=0.6')
curs.execute('select molregno,m from mols join fps using (molregno) where btfp%%bfp_from_binary_text(%s)',
             (DataStructs.BitVectToBinaryText(fp),))
res = curs.fetchall()
len(res)
Out[8]:
3

It's always good to test that we're getting the right answers, so let's verify that by repeating the same process in Python (just to be sure we did everything right!):

In [9]:
len([x for (x,y) in fps if DataStructs.TanimotoSimilarity(y,fp)>=0.6])
Out[9]:
3

Pull back some similarity values:

In [10]:
curs.execute('select molregno,tanimoto_sml(btfp,bfp_from_binary_text(%s)) from fps limit 10',
             (DataStructs.BitVectToBinaryText(fp),))
curs.fetchall()
Out[10]:
[(23, 0.159090909090909),
 (97, 0.223880597014925),
 (115, 0.256410256410256),
 (146, 0.228571428571429),
 (147, 0.160714285714286),
 (148, 0.102941176470588),
 (173, 0.226415094339623),
 (194, 0.160714285714286),
 (213, 0.333333333333333),
 (205, 0.203125)]

And again, just a test, ensure that we're getting the same thing we'd see in Python:

In [11]:
[(x,DataStructs.TanimotoSimilarity(y,fp)) for (x,y) in fps][:10]
Out[11]:
[(23, 0.1590909090909091),
 (97, 0.22388059701492538),
 (115, 0.2564102564102564),
 (146, 0.22857142857142856),
 (147, 0.16071428571428573),
 (148, 0.10294117647058823),
 (173, 0.22641509433962265),
 (194, 0.16071428571428573),
 (213, 0.3333333333333333),
 (205, 0.203125)]

That's it. Not particularly complicated, but still a useful thing to know how to do. Following this approach, any bit vector fingerprint (even one's generated outside of the RDKit) can be inserted into PostgreSQL tables and then searched using the RDKit cartridge.

Thursday, January 19, 2017

A great-looking option for doing easy parallel/distributed computation in python

[Updated 20.01.2017 to use parallel file reading]

A few years ago I blogged about using an early version of Blaze to do some interesting data manipulations. Since then I've been paying attention to the progress of the various pieces of the Blaze ecosystem and experimenting with them. I have some really interesting results from using Dask to work with large amounts of chemical data, but those are taking me a while to write up, so I haven't been blogging about this.

Today I managed to make a bit of time to do some experimentation with using Dask to do parallel computation. As usual, I want to be able to do that with molecules, but fortunately Dask's bag data structure makes that pretty easy.

Here's a short demo of that, without a lot of extra text:

Start by importing dask and the RDKit:
In [4]: import dask.bag as db
In [5]: from rdkit import Chem
Set up a file reader to pull in one of the Zinc15 tranches, the blocksize argument allows parallel file reading:
In [6]: text = db.read_text('zinc15_CD.rdk.smi', blocksize =int(1e7))
Check how many lines we have; this is the first time we actually do any work:
In [7]: text.count().compute()
Out[7]: 6008877
Create the substructure query we will use: 
In [10]: p = Chem.MolFromSmarts('c1nc(C)nc(O)c1')
Now create a bag containing the molecules:
In [12]: mbag = text.map(lambda x: Chem.MolFromSmiles(x.split()[0], sanitize=False))
And then filter the bag to only contain the molecules that match our substructure query:
In [13]: mms = mbag.filter(lambda x: x is not None).filter(lambda x,y=p: x.HasSubstructMatch(y))
Finally, pull out the results and look at how long it takes:
In [14]: import time
In [15]: t1 = time.time(); ms = mms.compute(); print(" runtime: %.2fs"%(time.time()-t1))
runtime: 49.92s
In [16]: len(ms)
Out[16]: 5739

This got all the cores on my desktop machine happily working away.

~50 seconds to load 6 million molecules from SMILES and do a substructure filter across them without doing any precomputation seems pretty good to me.

A warning to anyone who wants to play at home: you need to be careful about what you call compute() on. Doing mbag.compute() will cause dask to happily go off and try to load 6 million molecules at once. Unless you have a lot more RAM than I do, this is not going to make you happy.

There's some really cool stuff going on in the Python data science world.