I had a couple of online conversations this week about working with the PostgreSQL cartridge from Python and sending molecules back and forth between Python and the database. Here's a quick blogpost on that topic.
from rdkit import Chem
import gzip
import time
import psycopg2
import json
The dataset we'll use here is a set of molecules+activity data from ChEMBL26. We take all the measured Ki values that are less than 1nM.
# here's how I constructed the dataset conn2 = psycopg2.connect("dbname=chembl_26 host=localhost") curs2 = conn2.cursor() curs2.execute('''select cid1.chembl_id as compound_chembl_id,cid2.chembl_id as assay_chembl_id, target_dictionary.chembl_id as target_chembl_id,target_dictionary.pref_name as pref_name, standard_relation,standard_value,standard_units,standard_type,molfile from activities acts join assays using (assay_id) join compound_structures using (molregno) join chembl_id_lookup cid1 on (molregno=entity_id and entity_type='COMPOUND') join chembl_id_lookup cid2 on (assay_id=cid2.entity_id and cid2.entity_type='ASSAY') join target_dictionary using (tid) where standard_type='Ki' and standard_units='nM' and standard_value is not null and standard_relation='=' and standard_value<1''') data = curs2.fetchall() import gzip cnames = [x.name for x in curs2.description] w = Chem.SDWriter(gzip.open('/home/glandrum/RDKit_blog/data/chembl26_very_active.sdf.gz','wt+')) for row in data: m = Chem.MolFromMolBlock(row[-1]) for i in range(len(cnames)-1): m.SetProp(cnames[i],str(row[i])) w.write(m) w=None
Let's start by assembling a benchmarking set of 30K molblocks:
molblocks = []
nms = []
with gzip.open('../data/chembl26_very_active.sdf.gz','r') as inf:
suppl = Chem.ForwardSDMolSupplier(inf)
while len(molblocks)<30000:
m = next(suppl)
if not m:
continue
nms.append(m.GetProp('compound_chembl_id'))
molblocks.append(Chem.MolToMolBlock(m))
How long does it take to parse all the molblocks on the python side?
t1 = time.time()
ms = [Chem.MolFromMolBlock(mb) for mb in molblocks]
t2 = time.time()
print(f" that took {t2-t1 :.2f} seconds")
What about to do the same work in the database?
conn = psycopg2.connect("dbname=demodb host=localhost")
curs = conn.cursor()
curs.execute('drop table if exists molbs')
curs.execute('drop table if exists mols')
curs.execute('create table molbs (chembl_id text,molb text)')
curs.executemany('insert into molbs values (%s,%s)',[(x,y) for x,y in zip(nms,molblocks)])
t1 = time.time()
curs.execute('select chembl_id,mol_from_ctab(molb::cstring) m into mols from molbs')
conn.commit()
t2 = time.time()
print(f" that took {t2-t1 :.2f} seconds")
Notice that we also had to transfer the mol blocks to the database. I didn't include that in the timing results because we're just looking at processing time.
Sending binary molecules to the database¶
It seems silly to do the work of processing the mol blocks in the database a second time. Fortunately, we can add the molecules to the database in RDKit's binary form:
conn = psycopg2.connect("dbname=demodb host=localhost")
curs = conn.cursor()
curs.execute('drop table if exists mols')
curs.execute('create table mols (chembl_id text,m mol)')
t1 = time.time()
curs.executemany('insert into mols values (%s,mol_from_pkl(%s))',[(x,y.ToBinary(),) for x,y in zip(nms,ms)])
conn.commit()
t2 = time.time()
print(f" that took {t2-t1 :.2f} seconds")
Retrieving binary molecules from the database¶
What about going the other way: we have binary molecules in the database and want to pull them back into Python to work with them?
conn = psycopg2.connect("dbname=demodb host=localhost")
curs = conn.cursor()
t1 = time.time()
curs.execute('select chembl_id,mol_to_pkl(m) from mols')
tms = [Chem.Mol(x[1].tobytes()) for x in curs.fetchall()]
t2 = time.time()
print(f" that took {t2-t1 :.2f} seconds")
We can, of course, do searches in the database and then pull just the molecules from the search results into Python:
conn = psycopg2.connect("dbname=demodb host=localhost")
curs = conn.cursor()
t1 = time.time()
curs.execute('select chembl_id,mol_to_pkl(m) from mols where m@>mol_from_smarts(%s)',('c1ncn[o,n]1',))
tms = [Chem.Mol(x[1].tobytes()) for x in curs.fetchall()]
t2 = time.time()
print(f" that took {t2-t1 :.2f} seconds and returned {len(tms)} results")
Example 1: adding descriptors to the database¶
The cartridge can calculate a number of molecular descriptors directly, but there are more available in Python. Let's calculate some of those and add them to the database.
We'll pull the binary molecules from the database, calculate BCUT2D descriptors for them, and then add the descriptors to a new database table.
from rdkit.Chem import Descriptors
from rdkit.Chem import rdMolDescriptors
nBCuts = len(rdMolDescriptors.BCUT2D(Chem.MolFromSmiles('c1ccccc1')))
descrdefn = ','.join(f'bcut_{i+1} float' for i in range(nBCuts))
descrholder = ','.join(['%s']*nBCuts)
conn = psycopg2.connect("dbname=demodb host=localhost")
curs = conn.cursor()
curs.execute('drop table if exists bcuts')
curs.execute(f'create table bcuts (chembl_id text,{descrdefn})')
curs.execute('select chembl_id,mol_to_pkl(m) from mols')
bcut_data = []
for row in curs.fetchall():
trow = [row[0]]
mol = Chem.Mol(row[1].tobytes())
try:
descrs = rdMolDescriptors.BCUT2D(mol)
except ValueError:
continue
trow.extend(descrs)
bcut_data.append(trow)
cmd = f'insert into bcuts values (%s,{descrholder})'
curs.executemany(cmd,bcut_data)
conn.commit()
Since the bcuts descriptors use partial charges and we don't have parameters for all atom types, some molecules don't have values:
curs.execute('select count(*) from bcuts')
curs.fetchone()
curs.execute('select count(*) from mols')
curs.fetchone()
Example 2: Loading SDF data into the database¶
We loaded the molecules from an SDF, but ignored the data fields in that SDF. Now let's load those into the database too.
We'll take advantage of PostgreSQL's jsonb type to store the properties on each molecule in a dictionary-like object.
Let's start by loading the data:
conn = psycopg2.connect("dbname=demodb host=localhost")
curs = conn.cursor()
curs.execute('drop table if exists mols')
curs.execute('create table mols (chembl_id text,m mol,sdf_data jsonb)')
rows = []
with gzip.open('../data/chembl26_very_active.sdf.gz','r') as inf:
suppl = Chem.ForwardSDMolSupplier(inf)
while len(rows)<30000:
m = next(suppl)
if not m:
continue
nm = m.GetProp('compound_chembl_id')
props = m.GetPropsAsDict()
rows.append((nm,m.ToBinary(),json.dumps(props)))
curs.executemany('insert into mols values (%s,mol_from_pkl(%s),%s)',rows)
conn.commit()
Demonstrate how to do a string query:
curs.execute("select count(*) from mols where sdf_data->>'pref_name' = 'Human immunodeficiency virus type 1 protease'")
curs.fetchone()
And a query on a floating point value, here we're counting the number of rows where the Ki value is less than 10 picomolar.
curs.execute("select count(*) from mols where (sdf_data->>'standard_value')::float < 0.01")
curs.fetchone()
Get all the rows with measurements aginst HIV protease where Ki < 1 picomolar
curs.execute("select chembl_id,m from mols where sdf_data->>'pref_name' = 'Human immunodeficiency virus type 1 protease'\
and (sdf_data->>'standard_value')::float < 0.01")
d = curs.fetchall()
d[:2]
Notice that we get CXSMILES with coordinates back from the database. We loaded the compounds from the SDF with coordinates and the CXSMILES coming back from the database includes those coordinates.
Let's do one last query there to look at the composition of the dataset:
curs.execute("select sdf_data->>'pref_name',count(distinct(chembl_id)) cnt \
from mols group by (sdf_data->>'pref_name') order by cnt desc limit 20")
curs.fetchall()
I think using JSONB this way, which basically lets us combine a relational database with a document store, opens up a bunch of interesting possibilties. I'll try and do another blog post on that in the near(ish) future.