Monday, April 27, 2020

Possible Google Season of Docs Projects

We're planning to apply to be a mentoring organization for Google Season of Docs this year.

The primary instigator/organizer of this is Vincent Scalfani. For people who may not have "met" him yet, Vin is the one who wrote the fantastic new version of the RDKit Cookbook and has made a number of other contributions to improve the RDKit documentation.

Here's Vin's anouncement about our planned participation:
https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09805.html

And here are the ideas we're currently thinking of. I will try and keep this post up to date as our thinking evolves. If you have other ideas or suggestions, please get in touch!

Project Name: Expansion of the RDKit Book to Outline Additional Methods Implemented




Description: The RDKit Book serves to describe and outline supported features available within RDKit [1]. The Book offers a high level overview of supported features for users. Some examples include supported molecular file formats, molecular descriptor calculation methods, and chemical reaction support. There are many supported methods that are not yet documented in the RDKit Book such as additional file format reading support, available molecular descriptor/fingerprint calculations (e.g., Morse atom fingerprints, Coulomb Matrices, QED descriptor), and chemical validation/standardization (MolVS) integration. This project would inventory the available features mentioned in the RDKit Release notes [2] that are not yet described in the RDKit Book. These additional features would then be added to the RDKit book with a description, example, and links to the original GitHub pull request that added the feature. Moreover, related scientific literature references and a link to the API docs describing the module can also be added. Further, it would be useful to update existing feature methods currently in the RDKit Book with links to the GitHub code, related literature, and API docs, where possible.
Related Material: 
[2] RDKit Release Notes: https://github.com/rdkit/rdkit/releases


Project Name:  A Guide for Reading and Making the Most of the RDKit Python API Docs

Description: The RDKit Python API Documentation serves as the comprehensive reference guideto the available Python modules for accessing RDKit functionality with Python [1]. This reference work can be intimidating and confusing to new users. A highly useful contribution to the RDKit documentation would be to create a guide on how to read and use the RDKit Python API documentation. This could include:
  1. A summary overview of the different packages/subpackages along with an explanation of the syntax and instructions for importing the appropriate function in a Python script (see the SciPy API docs as an example [2])
  2. A graphical depiction of the API structure
  3. A worked example of using a particular module and demonstrating how to import the module, use a particular class, and specify options
  4. An explanation of what the C++ signatures mean and why these can be useful within the context of the Python API.
  5. Addition of links to the source code (see for example the Deepchem API docs [3])
Related Material: 
[1] RDKit Python API Reference: https://www.rdkit.org/docs/api-docs.html
[3] DeepChem API Docs: https://www.deepchem.io/docs/deepchem.html 



Project Name: RDKit and Pandas Book

Description: RDKit integrates with the Python Pandas data analysis and data structures tools [1]. This integration allows users to store and analyze molecules in convenient data frames [2]. There is not yet a detailed formal documentation guide to using RDKit with Pandas. However, there are individual code examples scattered throughout Stack Overflow [3], the RDKit mailing lists [4], and GitHub [5]. It would be useful to compile and harmonize these examples into an in-depth RDKit and Pandas Documentation Book. Similarly to other RDKit Documentation, the RDKit and Pandas Book would be written in reStructuredText with Sphinx doctest to allow testing of code snippets, where possible.

Related Material:
[1] https://pandas.pydata.org/
[2] https://www.rdkit.org/docs/source/rdkit.Chem.PandasTools.html;
[3] https://stackoverflow.com/search?q=rdkit+pandas;
[4] https://www.mail-archive.com/search?q=pandas&l=rdkit-discuss%40lists.sourceforge.net;
[5] https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/004_RDKit_pandas_support.ipynb

Sunday, April 26, 2020

New drawing options in the 2020.03 release

The 2020.03 release includes a set of significant improvements to the RDKit molecule drawing code. If you're interested in the history, the original issue is here.

The work for this was done by Dave Cosgrove and it was funded by Medchemica (the changes tracked in that github issue), and T5 Informatics (atom and bond annotations). Many thanks to Al and Ed at Medchemica for the funding and to Dave for doing the work!

As an aside to people in companies that are using the RDKit: this is a good model for how to get specific improvements/additions made to the RDKit. You may not have developers working in your organization, but you can fund a third party to do the work. This approach is most effective when there's some communication with the RDKit development team before starting (that's what Dave did with the ticket above).

In [1]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdDepictor
rdDepictor.SetPreferCoordGen(True)
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG
import rdkit
print(rdkit.__version__)
2020.03.1

Let's start by drawing a molecule using a MolDraw2DSVG() object:

In [2]:
diclofenac = Chem.MolFromSmiles('O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl')
In [3]:
d2d = rdMolDraw2D.MolDraw2DSVG(350,300)
d2d.DrawMolecule(diclofenac)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[3]:
O HO HN Cl Cl

Atom and bond indices

We can easily add atom indices to the drawing:

In [4]:
d2d = rdMolDraw2D.MolDraw2DSVG(350,300)
d2d.drawOptions().addAtomIndices=True
d2d.DrawMolecule(diclofenac)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[4]:
O HO HN Cl Cl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

or bond indices:

In [5]:
d2d = rdMolDraw2D.MolDraw2DSVG(350,300)
d2d.drawOptions().addBondIndices=True
d2d.DrawMolecule(diclofenac)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[5]:
O HO HN Cl Cl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

It's also possible to show both atom and bond indices, but this tends to be a bit crowded.

Atom and bond annotations

It's also possible to add user-provided annotations to either atoms or bonds by setting atom/bond properties:

In [6]:
cp = Chem.Mol(diclofenac)
cp.GetAtomWithIdx(2).SetProp("atomNote","pKa=4.0")
d2d = rdMolDraw2D.MolDraw2DSVG(350,300)
d2d.drawOptions().setHighlightColour((0.8,0.8,0.8))
d2d.DrawMolecule(cp,highlightAtoms=[2])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[6]:
O HO HN Cl Cl pKa=4.0
In [7]:
cp = Chem.Mol(diclofenac)
cp.GetBondWithIdx(0).SetProp("bondNote","note")
cp.GetBondWithIdx(1).SetProp("bondNote","note 2")
d2d = rdMolDraw2D.MolDraw2DSVG(350,300)
d2d.drawOptions().setHighlightColour((0.8,0.8,0.8))
d2d.DrawMolecule(cp,highlightAtoms=[],highlightBonds=[0,1])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[7]:
O HO HN Cl Cl note note 2

Stereochemistry

There's a convenience option to show stereochemistry labels.

Let's start with the two enantiomers of threonine:

In [8]:
l_threonine = Chem.MolFromSmiles('C[C@@H](O)[C@H](N)C(O)=O')
d_threonine = Chem.MolFromSmiles('C[C@H](O)[C@@H](N)C(O)=O')

d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
d2d.DrawMolecules([l_threonine,d_threonine])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[8]:
OH H2N OH O OH H2N OH O

It can also be nice to show the absolute stereo labels to the drawing:

In [9]:
d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
d2d.drawOptions().addStereoAnnotation=True
d2d.DrawMolecules([l_threonine,d_threonine])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[9]:
OH H2N OH O (R) (S) OH H2N OH O (S) (R)

Note: the RDKit code that does CIP assignments (i.e. R/S and E/Z assignment) is not a faithful implementation of the CIP rules, which are complicated. The labels are reliable (in the sense that you'll always get the same answer for the same molecule, no matter how it's drawn), but the assignment of atom priorities is only an approximation to the actual CIP rules.

We hope to have a better implementation of true CIP labels in a future release, but definitely take the current labels with a giant grain of salt.

Now let's look allothreonine, the diasteromers of threonine:

In [10]:
l_allothreonine = Chem.MolFromSmiles('C[C@H](O)[C@H](N)C(O)=O')
d_allothreonine = Chem.MolFromSmiles('C[C@@H](O)[C@@H](N)C(O)=O')
d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
d2d.drawOptions().addStereoAnnotation=True
d2d.DrawMolecules([l_allothreonine,d_allothreonine])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[10]:
OH H2N OH O (S) (S) OH H2N OH O (R) (R)

adaptations to different drawing sizes:

I've been using large drawings to make things super clear, but one of the other big improvements is that font scaling now works better:

In [11]:
d2d = rdMolDraw2D.MolDraw2DSVG(300,140,150,140)
d2d.drawOptions().addStereoAnnotation=True
d2d.DrawMolecules([l_allothreonine,d_allothreonine])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[11]:
OH H2N OH O (S) (S) OH H2N OH O (R) (R)
In [12]:
d2d = rdMolDraw2D.MolDraw2DSVG(200,80,100,80)
d2d.drawOptions().addStereoAnnotation=True
d2d.DrawMolecules([l_allothreonine,d_allothreonine])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[12]:
OH H2N OH O (S) (S) OH H2N OH O (R) (R)

If we want the molecule scaling to be insensitive to the size of the drawing surface, we can use the fixedBondLength option:

In [13]:
d2d = rdMolDraw2D.MolDraw2DSVG(200,100)
d2d.drawOptions().addStereoAnnotation=True
d2d.drawOptions().fixedBondLength=30

d2d.DrawMolecule(l_allothreonine)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[13]:
OH H2N OH O (S) (S)

If I increase the size of the drawing surface but keep the same fixedBondLength value, the molecule is drawn the same size:

In [14]:
d2d = rdMolDraw2D.MolDraw2DSVG(400,200)
d2d.drawOptions().addStereoAnnotation=True
d2d.drawOptions().fixedBondLength=30

d2d.DrawMolecule(l_allothreonine)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[14]:
OH H2N OH O (S) (S)

Note that if the canvas is too small to fit the molecule, the bond size will be overridden.

In [15]:
d2d = rdMolDraw2D.MolDraw2DSVG(100,75)
d2d.drawOptions().addStereoAnnotation=True
d2d.drawOptions().fixedBondLength=30

d2d.DrawMolecule(l_allothreonine)
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[15]:
OH H2N OH O (S) (S)

What about enhanced stereo?

To display enhanced stereo information we currently need to use an extra function that adds the atom annotations:

In [16]:
def addEnhancedStereoAnnotations(m):
    gpLookup = {Chem.StereoGroupType.STEREO_OR:"or",
                Chem.StereoGroupType.STEREO_AND:"&",
                Chem.StereoGroupType.STEREO_ABSOLUTE:"abs",
               }
    sgs = m.GetStereoGroups()
    for i,sg in enumerate(sgs):
        typ = gpLookup[sg.GetGroupType()]
        for at in sg.GetAtoms():
            nt = ""
            if at.HasProp("atomNote"):
                nt += at.GetProp("atomNote")+","
            nt += f"{typ}{i+1}"
            at.SetProp("atomNote",nt)
In [17]:
threonine_and = Chem.MolFromSmiles('C[C@@H](O)[C@H](N)C(O)=O |&1:1,3|')
threonine_or = Chem.MolFromSmiles('C[C@@H](O)[C@H](N)C(O)=O |o1:1,3|')
addEnhancedStereoAnnotations(threonine_and)
addEnhancedStereoAnnotations(threonine_or)
d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
d2d.DrawMolecules([threonine_and,threonine_or])
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[17]:
OH H2N OH O &1 &1 OH H2N OH O or1 or1

It should be easier to get this information displayed, so we'll add a new drawing option for that in the next release.

Advanced highlighting

Let's highlight pharmacophore features on a molecule. The tricky bit here is that one atom can be in multiple features. Fortunately there is new code to support this.

In [18]:
atenolol = Chem.MolFromSmiles('CC(C)NCC(O)COc1ccc(CC(N)=O)cc1')
atenolol
Out[18]:

It's worth pointing out how much nicer this drawing looks than it used to since the H is now above the amine N (instead of next to it).

Start by getting the feature factory that we'll use to find the chemical features.

The default feature definition file that the RDKit includes isn't so great, but fortunately some years ago Markus Kossner contributed a better one that lives in the Contrib directory. We'll use that

In [19]:
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
import requests
# grab a feature definition file from the RDKit's contrib dir in github:
res = requests.get('https://raw.githubusercontent.com/rdkit/rdkit/master/Contrib/M_Kossner/BaseFeatures_DIP2_NoMicrospecies.fdef')
fdef = res.text
ffact = ChemicalFeatures.BuildFeatureFactoryFromString(fdef)
In [20]:
from collections import defaultdict
feats = ffact.GetFeaturesForMol(atenolol)
colors = {'SingleAtomDonor':(1,0,0),
          'SingleAtomAcceptor':(0,0,1),
          'BasicGroup':(0,1,1),
          'Arom6':(1,0.8,0.5),
          'Hphobe':(0.7,0.7,0.3)}
atomHighlights = defaultdict(list)
highlightRads = {}
for feat in feats:
    if feat.GetType() in colors:
        clr = colors[feat.GetType()]
        for aid in feat.GetAtomIds():
            atomHighlights[aid].append(clr)
            highlightRads[aid] = 0.5

d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
d2d.DrawMoleculeWithHighlights(atenolol,"atenelol",dict(atomHighlights),{},highlightRads,{})
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[20]:
N H OH O NH2 O atenelol

The atom highlights that we add are drawng as ellipses so that they can cover the full atom label, but it's possible to force them to be circles. We can also draw them just as outlines:

In [21]:
d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
dos = d2d.drawOptions()
dos.atomHighlightsAreCircles = True
dos.fillHighlights=False
d2d.DrawMoleculeWithHighlights(atenolol,"atenelol",dict(atomHighlights),{},highlightRads,{})
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[21]:
N H OH O NH2 O atenelol

We can also have multiple highlights on bonds. Here's an example showing the ring systems in a molecule from ChEMBL:

In [22]:
chembl2314369=Chem.MolFromSmiles('N#CC(C#N)=C1C(=O)c2cccc3cccc1c23')
rings = chembl2314369.GetRingInfo()

colors = [(0.8,0.0,0.8),(0.8,0.8,0),(0,0.8,0.8),(0,0,0.8)]

athighlights = defaultdict(list)
arads = {}
for i,rng in enumerate(rings.AtomRings()):
    for aid in rng:
        athighlights[aid].append(colors[i])
        arads[aid] = 0.3

bndhighlights = defaultdict(list)
for i,rng in enumerate(rings.BondRings()):
    for bid in rng:
        bndhighlights[bid].append(colors[i])
    
d2d = rdMolDraw2D.MolDraw2DSVG(400,400)
d2d.DrawMoleculeWithHighlights(chembl2314369,'CHEMBL2314369',dict(athighlights),dict(bndhighlights),arads,{})
d2d.FinishDrawing()
SVG(d2d.GetDrawingText())
Out[22]:
N N O CHEMBL2314369

There are some other improvements under the hood, but this covers the major new stuff.