Thursday, March 3, 2016

Capturing error information

When the RDKit has problems processing a molecule, it outputs information to the error console about what those problems were. Here's an example:

In [23]: m = Chem.MolFromSmiles('CO(C)C')
[06:18:04] Explicit valence for atom # 1 O, 3, is greater than permitted

It's sometimes useful to have programmatic access to this information for later use in reporting.

It also would be great if these types of messages were visible in the jupyter notebook.

Brian Kelley recently added functionality to the RDKit to enable both of these things. For anyone interested, the two pull requests for those changes are: #736 and #739.

This is a short note on how to take advantage of that.

A couple of things to note:

  • This is currently in git and will be available in the 2016.03 release.
  • This post was written using Python3, some adaptation would be required for Python2.

Let's start by showing the standard state of affairs:

In [2]:
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
2016.03.1.dev1
In [3]:
m = Chem.MolFromSmiles('CO(C)C')
m

At this point there is an error message in the console I launched Jupyter from, but it sure would be nice if it were visible here.

We can enable that by just importing the usual RDKit Jupyter integration code:

In [4]:
from rdkit.Chem.Draw import IPythonConsole
In [6]:
m = Chem.MolFromSmiles('CO(C)C')
RDKit ERROR: [06:46:02] Explicit valence for atom # 1 O, 3, is greater than permitted
In [7]:
Chem.MolFromSmiles('c1cc1')
RDKit ERROR: [06:46:11] Can't kekulize mol 
RDKit ERROR: 
In [8]:
Chem.MolFromSmiles('c1')
RDKit ERROR: [06:46:18] SMILES Parse Error: unclosed ring for input: 'c1'
In [9]:
Chem.MolFromSmiles('Ch')
RDKit ERROR: [06:46:41] SMILES Parse Error: syntax error for input: 'Ch'

So far so good. What if I want to have access to the error messages as strings in Python?

In [13]:
from io import StringIO
import sys
Chem.WrapLogs()
In [15]:
sio = sys.stderr = StringIO()
Chem.MolFromSmiles('Ch')
print("error message:",sio.getvalue())
error message: RDKit ERROR: [06:49:14] SMILES Parse Error: syntax error for input: 'Ch'

I can use this to write a bit of code that processes all of the molecules in an SDF and captures the errors:

In [16]:
def readmols(suppl):
    ok=[]
    failures=[]
    sio = sys.stderr = StringIO()
    for i,m in enumerate(suppl):
        if m is None:
            failures.append((i,sio.getvalue()))
            sio = sys.stderr = StringIO() # reset the error logger
        else:
            ok.append((i,m))
    return ok,failures
In [18]:
import gzip,os
from rdkit import RDConfig
inf = gzip.open(os.path.join(RDConfig.RDDataDir,'PubChem','Compound_000200001_000225000.sdf.gz'))
suppl = Chem.ForwardSDMolSupplier(inf)
ok,failures = readmols(suppl)
In [19]:
for i,fail in failures:
    print(i,fail)
2035 RDKit ERROR: [07:31:28] Explicit valence for atom # 0 Br, 5, is greater than permitted
RDKit ERROR: [07:31:28] ERROR: Could not sanitize molecule ending on line 404864

11460 RDKit ERROR: [07:31:28] ERROR: Explicit valence for atom # 0 Br, 5, is greater than permitted
RDKit ERROR: [07:31:32] Explicit valence for atom # 2 Te, 4, is greater than permitted
RDKit ERROR: [07:31:32] ERROR: Could not sanitize molecule ending on line 2344967

17016 RDKit ERROR: [07:31:32] ERROR: Explicit valence for atom # 2 Te, 4, is greater than permitted
RDKit ERROR: [07:31:34] Explicit valence for atom # 1 Br, 5, is greater than permitted
RDKit ERROR: [07:31:34] ERROR: Could not sanitize molecule ending on line 3489884

In [ ]:
 

2 comments:

Paul Emsley said...

This is just what I wanted to find! (Parsing CCD SMILES).

freedom said...

This feature is very helpful for me.