[Dock-fans] DB2 gen Pipeline - failed molecules

Corey Taylor corey.taylor at uni-marburg.de
Thu Mar 1 07:53:50 PST 2018

Dear fellow fans of DOCK,

I've been using the pipeline for db2 file generation that comes with 
DOCK 3.7 and in general it does a great job of generating molecules. 
However, when trying to parameterise the KEGG dataset as downloaded from 
here (http://zinc.docking.org/catalogs/keggviapc), there seems to be 
quite a few molecules which not end up in the 'failed' folder (not in 
and of itself a problem) but literally fail so hard the pipeline stops 
running entirely.

The following command:

$DOCKBASE/ligand/generate/build_database_ligand.sh -s KEGG_test.smi

Seems to create protomers okay:

Precomputing protomers for all compounds (pH: 7.4 6.4 8.4)
ph 7.4: 1 protomers created
ph 6.4: 1 protomers created
ph 8.4: 1 protomers created
Coalesing and merging protomers
1 protomers generated for 1 compounds

But then stops upon running AMSOL:

Refusing to build conformations with > 5 rotatable hydrogens
Conformer generation failed
Skipping ZINC08214483 0

Logs in /failed/ZINC08214483 seem to be generated so presumably AMSOL 
itself runs okay. The molecules which cause these problems (~1% of the 
SMILES from the above link) tend to be very large molecules, lots of 
stereocentres (10 or more) and/or with probably quite unusual 
protonation states, as you'd expect in dataset like KEGG. Here are a 
couple of examples:

C1[C@@H]([C at H]([C@@H]([C at H]([C@@H]1NC(=O)[C at H](CC[NH3+])O)O[C@@H]1[C@@H]([C at H]([C@@H]([C at H](O1)CO)O)N)O)O)O[C@@H]1[C@@H]([C at H]([C@@H]([C at H](O1)C[NH3+])O)O)O)[NH3+] 
Cc1c([nH+]c[nH]1)CSCCN/C(=N/C#N)/NCC#C    ZINC11616902

So my questions are:

- Is this generally the case that these molecules will fail? i.e. there 
are no tweaks, parameters or options in the pipeline that will lead to 
db2 files for weird molecules like these? This isn't a big problem, per 
se as a lot of molecules of this nature we probably wouldn't DOCK anyway 
- Is there any obvious reason why these molecules would stop the scripts 
in the pipeline dead or, better, any way to avoid this? Or do we just 
have to live with some crashes? Although AMSOL runs okay, perhaps if all 
attempts at protomer generation fail, a downstream script ends up with 
an empty variable/container to handle and crashes? You can imagine it 
gets frustrating when parameterising 10K molecules, if every 100th 
molecule fails and crashes the pipeline...

Of course, if s/w used in the pipeline simply will fail for any 
molecules with > 5 seterocentres, phosphorus, etc., then of course I can 
just write a script to omit these. Just curious if I'm missing something 

Cheers guys,

Corey Taylor
Kolb Lab
Institute of Pharmaceutical Chemistry
Philipps-University Marburg
Marbacher Weg 6
35032 Marburg

Mailto: corey.taylor at uni-marburg.de
Ph: +49 6421 28 21351
Web: http://www.kolblab.org/taylor.html

More information about the Dock-fans mailing list