I got a copy of Thing Explainer a while ago and I think it's pure genius (not that surprising from Randall Munroe). With my thinking infected by reading the book, the other day I wondered what the ten hundred most common "chemical words" are.
After a bit of extremely deep thinking I decided that I would be willing to accept that a "chemical word" is the kind of fragment that comes out of something like BRICS (or RECAP) decomposition of a molecule.
This post takes that simple idea and applies it to a big database of molecules: the almost 110 million molecules that are in the excellent Zinc15.
NOTE: The HTML for this post was too much for blogger to handle, so this is a somewhat truncated version. The original notebook is accessible herefrom rdkit import Chem
from rdkit.Chem import BRICS
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
%pylab inline
print(rdBase.rdkitVersion)
import os,time,glob
print( time.asctime())
Background on BRICS¶
Before diving into the analysis, a bit of explanation about BRICS.
Let's start with the reference: Degen, J., Wegscheid-Gerlach, C., Zaliani, A. & Rarey, M. "On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces". ChemMedChem 3:1503–07 (2008).
The idea is to fragment molecules based on synthetically accessible bonds. Here's an example starting from a random ZINC molecule, ZINC00002139:
m = Chem.MolFromSmiles('CC[NH+](CC)CCOc1ccc(cc1)Cc2ccccc2')
m
And here are the fragments from doing a BRICS analysis:
pieces = [Chem.MolFromSmiles(x) for x in BRICS.BRICSDecompose(m)]
Draw.MolsToGridImage(pieces,molsPerRow=4)
The isotope labels on the molecules provide information about the type of bond that was broken and can be used later (using BRICS.BRICSBuild()
) to build new molecules. For example, the rules allow a bond to be formed between an atom connected to a [4*]
and an atom connected to a [5*]
.
One important characteristic (I hesitate to use "feature" in this context) of the current RDKit BRICS implementation that impacts this analysis is that only unique fragments are returned. So we only get one [4*]CC
fragment for that molecule above, even though it appears in the molecule twice.
So the analysis below is of the most common chemical words, but we only count frequency based on the number of molecules they appear in, not how often they appear in those molecules.
The computation¶
We're going to use some parallel processing, so bring that stuff in:
from ipyparallel import Client
rc = Client()
dview = rc[:]
with dview.sync_imports():
from rdkit import Chem
from rdkit.Chem import BRICS
import dbm,struct,os
The input is the SMILES files for the individual ZINC tranches:
infs = glob.glob('/scratch/RDKit_git/Data/Zinc15/split_files/*.smi')
print(len(infs))
infs[:5]
def process_file(fname):
"""
This runs through the molecules in a SMILES file and keeps track of the BRICS
fragments that they produce and the counts of those fragments.
The results are stored in a dbm database.
"""
outfname = os.path.splitext(fname)[0] + '.dbm'
res = 0
with open(fname,'r') as inf:
inf.readline() # header
nDone=0
with dbm.open(outfname,flag='n') as db:
for inl in inf:
m = Chem.MolFromSmiles(inl.split()[0])
nDone += 1
if m is None or m.GetNumHeavyAtoms()>60: continue
s = BRICS.BRICSDecompose(m)
for entry in s:
cnt = struct.unpack('I',db.get(entry,b'\0\0\0\0'))[0]+1
db[entry] = struct.pack('I',cnt)
res = len(db.keys())
return res
Let's get a sence of how long this might take by running one of the really small files:
import time
t1 = time.time();process_file('/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AB.smi');print("%.2f"%(time.time()-t1))
Tranche AB only has 177404 molecules (and they are small molecules), but that already takes >5 minutes. Doing the full dataset is going to take a long time....
We run the whole thing like this:
counts = dview.map_sync(process_file,infs)
However, this is a big dataset (about 110 million molecules) and that takes a couple of days to run on my machine, (using up to 6 threads). So this isn't really interactive. I started it and then came back in a couple of days to work up the data.
from collections import defaultdict
all_counts = defaultdict(int)
for i,fname in enumerate(infs):
print(i+1,fname,len(all_counts.keys()))
dbname = os.path.splitext(fname)[0] + '.dbm'
with dbm.open(dbname,flag='r') as db:
for key,val in db.items():
all_counts[key]+=struct.unpack('I',val)[0]
import pickle
pickle.dump(all_counts,open('../data/chemical_words_all_counts.pkl','wb+'))
itms = sorted(((y,x) for x,y in all_counts.items()),reverse=True)
len(itms)
So we got 1.6 million unique fragments from the dataset.
Let's look at the most and least frequent fragments:
itms[:10]
itms[-10:]
It's nicer to actually look at the molecules:
tc,ts = zip(*itms[:10])
Draw.MolsToGridImage([Chem.MolFromSmiles(x.decode('UTF-8')) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
tc,ts = zip(*itms[-10:])
Draw.MolsToGridImage([Chem.MolFromSmiles(x.decode('UTF-8')) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Ew... those aren't particularly nice. Good thing we won't be doing much with them.
Look at the number of times each frequency appears:
import math
def summarize(pts,maxCount=1e7):
bins=[0]*int(ceil(math.log10(maxCount)))
for cnt,smi in itms:
if cnt>maxCount:
bins[-1]+=1
else:
bins[int(floor(math.log10(cnt)))]+=1
return bins
bins=summarize(itms)
scatter(list(range(len(bins))),bins,lw=0)
_=yscale('log')
_=xlabel('log10(freq)')
_=ylabel('count')
The wikipedia says I'm not supposed to call that a power-law distribution, so I'll go with "power-law-like". Is that a thing?
Let's clean up the data a little bit: remove molecules without attachment points and then get rid of the isomeric markers that describe what type of bond was broken:
import re
expr = re.compile(r'[0-9]+\*')
clean_counts = defaultdict(int)
nRejected=0
for k,v in all_counts.items():
k = k.decode('UTF-8')
if k.find('*')<0:
nRejected +=1
continue
k = expr.sub('*',k)
clean_counts[k]+=v
clean_itms = sorted([(v,k) for k,v in clean_counts.items()],reverse=True)
print(len(clean_itms))
clean_itms[:10]
Most frequent fragments:
tc,ts = zip(*clean_itms[:10])
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Least frequent:
clean_itms[-10:]
tc,ts = zip(*clean_itms[-10:])
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Ok, so what about the 1000 most common BRICS fragments?
thousand = clean_itms[:1000]
scatter(range(1000),[x[0] for x in thousand],lw=0)
_=yscale('log')
_=ylabel('freq')
That's a steep drop off, but even the least common "word" appears >31000 times in our dataset:
thousand[-1]
How many attachment points do those fragments have?
_=hist([y.count('*') for x,y in thousand])
Who's got 4?
tmp = [x for x in thousand if x[1].count('*')==4]
tc,ts = zip(*tmp)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
What about 3?
SNIPThere are some examples in there that look like duplicates. That's just happening because the removal of the isotopic labels has made some stereoceneters look like they are no longer stereocenters when the SMILES is processed:
[x for x in thousand if x[1].count('*')==3][10:12]
We can do at least a little bit about that by sanitizing the molecules differently:
tc,ts = zip(*tmp)
tms = [Chem.MolFromSmiles(x,sanitize=False) for x in ts]
[Chem.SanitizeMol(x) for x in tms]
Draw.MolsToGridImage(tms,molsPerRow=4,legends=[str(x) for x in tc])
Still, this does highlight a minor problem with the simple string-manipulation approach to removing the isotopes from the fragments. Let's fix that problem by actually re-building the molecules. This time we'll remove stereochemistry that is no longer present when the isotope labels are removed from the dummies.
We've got a bunch of these, so this takes a while:
clean_counts2 = defaultdict(int)
nRejected=0
for k,v in all_counts.items():
k = k.decode('UTF-8')
if k.find('*')<0:
nRejected +=1
continue
k = Chem.MolToSmiles(Chem.MolFromSmiles(expr.sub('*',k)),True)
clean_counts2[k]+=v
import pickle
pickle.dump(clean_counts2,open('../data/chemical_words_clean_counts2.pkl','wb+'))
clean_itms2 = sorted([(v,k) for k,v in clean_counts2.items()],reverse=True)
print(len(clean_itms2),len(clean_itms))
thousand = clean_itms2[:1000]
So we did remove a few.
Let's look at what's left:
tmp = [x for x in thousand if x[1].count('*')==3]
tc,ts = zip(*tmp)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Write out our thousand common chemical words, this file will be in the github repo:
outf=open('../data/thousand_words.no_iso.smi','w+')
for tc,ts in thousand:
print(ts,tc,file=outf)
I'm going to want to use the ones that are still isotopically labeled later, so write those out too. This file will also be in the github repo:
t_itms = [(v,k.decode('UTF-8')) for k,v in all_counts.items()]
iso_itms = sorted([(v,k) for v,k in t_itms if k.count('*')>0],reverse=True)
print(len(iso_itms))
thousand_iso = iso_itms[:1000]
outf=open('../data/thousand_words.iso.smi','w+')
for tc,ts in thousand_iso:
print(ts,tc,file=outf)
As a teaser, let's make some new molecules:
import random
thousand_iso_frags = [Chem.MolFromSmiles(y) for x,y in thousand_iso]
new_mols = []
# the building algorithm tends to explore around a given point,
# so let's pick a number of starting points and enumerate a few molecules
# around each of them:
for i in range(5):
random.seed(0x1234+i)
brics_gen = BRICS.BRICSBuild(thousand_iso_frags)
for tm in [next(brics_gen) for x in range(4)]:
try:
Chem.SanitizeMol(tm)
except:
continue
new_mols.append(tm)
Draw.MolsToGridImage(new_mols,molsPerRow=4)
That's enough for now...