Building a Multi-Molecule Mol2 reader for RDKit.

In this mini-tools entry, I want to introduce a simple function to read multi-molecule mol2 files with RDKit. As you may know, reading multi-molecule sdf (or sd) files with is very easy using the module:

Chem.SDMolSupplier ()

However, at this moment, RDKit doesn´t have something like Chem.Mol2MolSupplier (). But, for some of us read a multi-molecule mol2 files could be useful and we don´t want to convert our mol2 files to sdf.

Importing the libraries

In [1]:
import os
import pandas as pd

from rdkit import Chem
from rdkit.Chem import Draw,AllChem
from rdkit.Chem.Draw import IPythonConsole

The next cell contains the function to read each molecule inside the multi-molecule mol2 file. Basically what it does is read each block of text containing the molecules inside of the mol2 file. Then, it converts each text block into an RDKit molecule type molecule by using the Chem.MolFromMol2Block() of RDKit.

It´s very rudimentary and simple, but it works!!!

In [2]:
def Mol2MolSupplier (file=None,sanitize=True):
    mols=[]
    with open(file, 'r') as f:
        line =f.readline()
        while not f.tell() == os.fstat(f.fileno()).st_size:
            if line.startswith("@<TRIPOS>MOLECULE"):
                mol = []
                mol.append(line)
                line = f.readline()
                while not line.startswith("@<TRIPOS>MOLECULE"):
                    mol.append(line)
                    line = f.readline()
                    if f.tell() == os.fstat(f.fileno()).st_size:
                        mol.append(line)
                        break
                mol[-1] = mol[-1].rstrip() # removes blank line at file end
                block = ",".join(mol).replace(',','')
                m=Chem.MolFromMol2Block(block,sanitize=sanitize)
            mols.append(m)
    return(mols)

I will use the following multi-molecule mol2 file to show how the function works. The file contains 179 different molecules from ZINC.

In [3]:
filePath ='for-sale+in-man+fda+named+endogenous.mol2'
In [4]:
database=Mol2MolSupplier(filePath,sanitize=True)
RDKit WARNING: [21:45:13] ZINC000003801919: warning - O.co2 with non C.2 or S.o2 neighbor.

Because we are using RDKit to convert mol2 block texts to RDKit molecules, we can use sanitization or not. Moreover, we can see the warning for sanitization problems. If a molecule is not valid we will get a 'None' element.

In [5]:
database[:10] #The first 10 elements in the list 
Out[5]:
[None,
 <rdkit.Chem.rdchem.Mol at 0x118863cb0>,
 <rdkit.Chem.rdchem.Mol at 0x118873300>,
 <rdkit.Chem.rdchem.Mol at 0x118873210>,
 None,
 None,
 <rdkit.Chem.rdchem.Mol at 0x1188733a0>,
 <rdkit.Chem.rdchem.Mol at 0x1188733f0>,
 <rdkit.Chem.rdchem.Mol at 0x118873440>,
 <rdkit.Chem.rdchem.Mol at 0x118873490>]

Once molecules loaded. We can perform any calculation available in RDKit or converting the molecules to other formats (e.g. SDF). For instance, creating a pandas table with some useful molecular information.

In [6]:
table=pd.DataFrame()
index=0
for mol in database:
    if mol:
        table.loc[index,'Name']=mol.GetProp('_Name')
        table.loc[index,'NumAtoms']=mol.GetNumAtoms()
        table.loc[index,'SMILES']=Chem.MolToSmiles(mol)
        index=index+1
In [7]:
table.head(10) #The first 10 non None elements in the list 
Out[7]:
Name NumAtoms SMILES
0 ZINC000000005878 9.0 NC(=O)c1cccnc1
1 ZINC000006661227 11.0 [NH3+][C@@H](Cc1c[nH]cn1)C(=O)[O-]
2 ZINC000006661227 11.0 [NH3+][C@@H](Cc1cnc[nH]1)C(=O)[O-]
3 ZINC000100009278 21.0 C[N@@H+]1[C@H]2CC[C@@H]1C[C@H](OC(=O)[C@@H](CO...
4 ZINC000100009278 21.0 C[N@H+]1[C@H]2CC[C@@H]1C[C@H](OC(=O)[C@@H](CO)...
5 ZINC000100009280 21.0 C[N@@H+]1[C@H]2CC[C@@H]1C[C@H](OC(=O)[C@H](CO)...
6 ZINC000100009280 21.0 C[N@H+]1[C@H]2CC[C@@H]1C[C@H](OC(=O)[C@H](CO)c...
7 ZINC000003830215 25.0 CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccc(O)cc3)...
8 ZINC000005133329 7.0 Oc1ccccc1
9 ZINC000118912393 21.0 C[C@]12CC[C@H]3[C@@H](CCC4=CC(=O)CC[C@@]43C)[C...

Drawing non None molecules keeping the 3D coordinates from the mol2 file.

In [8]:
no_none=[mol for mol in database if mol] # None element can´t be drawn, this loop keep only valid entries
[Chem.SanitizeMol(mol) for mol in no_none]
Draw.MolsToGridImage(no_none,molsPerRow=7,subImgSize=(150,150),legends=[mol.GetProp('_Name') for mol in no_none],maxMols=100)
Out[8]:
In [9]:
# Drawing 3 random molecules of non None list
Draw.IPythonConsole.drawMol3D(no_none[4])
Draw.IPythonConsole.drawMol3D(no_none[16])
Draw.IPythonConsole.drawMol3D(no_none[95])

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

There are many strategies to fix None entries. But this mini-tool is just an idea to use your multi-molecule mol2 files with RDKit. Hopefully you find this function useful for your work.


Comments

comments powered by Disqus