Saturday, January 31, 2015

Chemical Reaction Notes I

A question came up on the mailing list about using the RDKit ChemicalReaction functionality to replace Hs on an aromatic ring with fluorines. Here's an expanded version of my answer, which may be generally useful.

There's also a (very) brief introduction to recursive SMARTS at the bottom.

In [1]:
from __future__ import print_function
from rdkit import rdBase,Chem
from rdkit.Chem import AllChem,Draw
from rdkit.Chem.Draw import IPythonConsole
import time
print(rdBase.rdkitVersion)
print(time.asctime())
2015.03.1pre
Sat Jan 31 10:46:36 2015

Here's the reaction from the original question and an example of using it:

In [2]:
rxn = AllChem.ReactionFromSmarts('[c:1][c:2][c:3]>>[c:1][c:2]([F])[c:3]')
In [3]:
m = Chem.MolFromSmiles('c1ccccc1')
ps = rxn.RunReactants((m,))
print("Num products:",len(ps))
Num products: 12

We can get SMILES for the uniq products using a set:

In [4]:
uniq = set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in ps])
print("Unique products:",uniq)
Unique products: {'Fc1ccccc1'}

This reaction doesn't do exactly what we want though - it's just encoding addition of an F to an aromatic carbon. We can see the problem if we run the reaction on toluene:

In [5]:
m = Chem.MolFromSmiles('Cc1ccccc1')
ps = rxn.RunReactants((m,))
print("Num products:",len(ps))
uniq = set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in ps])
print("Unique products:",sorted(uniq))
Num products: 12
Unique products: ['Cc1(F)ccccc1', 'Cc1ccc(F)cc1', 'Cc1cccc(F)c1', 'Cc1ccccc1F']

That first product is certainly not what we intended.

Technical detail: I've used sorted(uniq) above because the order of the contents of a set is not guaranteed to be consistent in Python3.

We can fix the problem by specifying that the reacting atom needs to have an H on it:

In [6]:
rxn = AllChem.ReactionFromSmarts('[c:1][cH:2][c:3]>>[c:1][c:2]([F])[c:3]')

m = Chem.MolFromSmiles('Cc1ccccc1')
ps = rxn.RunReactants((m,))
print("Num products:",len(ps))
uniq = set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in ps])
print("Unique products:",sorted(uniq))
Num products: 10
Unique products: ['Cc1ccc(F)cc1', 'Cc1cccc(F)c1', 'Cc1ccccc1F']

That's much better, but why are we getting so many products?

The answer to this is the symmetry of the query defining the reactant. Let's demonstrate this with a somewhat simpler example:

In [7]:
m = Chem.MolFromSmiles('c1c(C)c(C)c(C)c(C)c1C')
ps = rxn.RunReactants((m,))
print("Num products:",len(ps))
uniq = set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in ps])
print("Unique products:",sorted(uniq))
Num products: 2
Unique products: ['Cc1c(C)c(C)c(F)c(C)c1C']

Now there's only one carbon where the reaction can occur.

The reacant query matches this carbon twice - once "forward" and once "backward":

In [8]:
list(m.GetSubstructMatches(rxn.GetReactantTemplate(0),uniquify=False))
Out[8]:
[(1, 0, 9), (9, 0, 1)]

Taking a second look at the reaction, we can see that atoms [c:1] and [c:3] don't actually participate in the reaction; they are just there to define the environment around [c:2]. Realizing this, we can rewrite the reaction to only include a single reactant atom whose environment is specified using a recursive SMARTS:

In [9]:
rxn = AllChem.ReactionFromSmarts('[cH&$(c(c)c):2]>>[c:2][F]')

m = Chem.MolFromSmiles('c1c(C)c(C)c(C)c(C)c1C')
ps = rxn.RunReactants((m,))
print("Num products:",len(ps))
uniq = set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in ps])
print("Unique products:",sorted(uniq))
Num products: 1
Unique products: ['Cc1c(C)c(C)c(F)c(C)c1C']

That's what we were looking for.

At this point anyone not pretty familiar with SMARTS is probably scratching their head and thinking that [cH&$(c(c)c):2] looks more like line noise than a sensible query.

Since recursive SMARTS are one of the most powerful, and useful, features of SMARTS, I'll close with a brief description of how it works.

A (very) brief introduction to recursive SMARTS

A normal SMARTS pattern is made up of a set of atoms, each of which must match an atom in the target molecule.

So a three atom query like this one:

In [10]:
q3 = Chem.MolFromSmarts('OC=C')

Matches three atoms in a target molecule:

In [11]:
list(Chem.MolFromSmiles('OC=CCC(C=COC)CO').GetSubstructMatches(q3))
Out[11]:
[(0, 1, 2), (7, 6, 5)]

Aside about the meaning of those results: the query matches twice, each match has two atoms.

A single atom query matches one atom:

In [12]:
q1 = Chem.MolFromSmarts('O')
list(Chem.MolFromSmiles('OC=CCC(C=COC)CO').GetSubstructMatches(q1))
Out[12]:
[(0,), (7,), (10,)]

A recursive SMARTS provides the ability to define an atom environment. So we can define a query that matches a single atom but that constrains the environment of that atom. This is done by including a full SMARTS pattern inside $().

Here's an example using the queries above, each of the matches contains a single atom:

In [13]:
qr = Chem.MolFromSmarts('[$(OC=C)]')
list(Chem.MolFromSmiles('OC=CCC(C=COC)CO').GetSubstructMatches(qr))
Out[13]:
[(0,), (7,)]

The recursive query can be combined with other query features to further constrain the match:

In [14]:
qr = Chem.MolFromSmarts('[OH&$(OC=C)]')
list(Chem.MolFromSmiles('OC=CCC(C=COC)CO').GetSubstructMatches(qr))
Out[14]:
[(0,)]

It is important to realize that the recursive SMARTS matches the first atom in the recursive query, so re-ordering it leads to different results:

In [15]:
qr = Chem.MolFromSmarts('[$(C=CO)]')
list(Chem.MolFromSmiles('OC=CCC(C=COC)CO').GetSubstructMatches(qr))
Out[15]:
[(2,), (5,)]

That one matches the carbon atoms.

Here we get no matches, because we ask for an atom that's both an [OH] and a [$(C=CO)]:

In [16]:
qr = Chem.MolFromSmarts('[OH&$(C=CO)]')
list(Chem.MolFromSmiles('OC=CCC(C=COC)CO').GetSubstructMatches(qr))
Out[16]:
[]

Many thanks to Matthew Lardy for the question that inspired this.