Hacker News new | ask | show | jobs
by momeara 2689 days ago
Co-author here (AMA)--A large-scale docking screen of 116M molecules takes ~1100 cpu days on our cluster, working out to about 1 mol/sec, which is very fast for virtual screening. What this doesn't account for is this requires about 30 minutes per compound to precompute information (conformations, partial charges, etc.). So this works out to ~6M cpu/hours to prepare the library for screening, which is a substantial amount of computation. We're loading about 1M molecules a day and have a 2-3 year backlog of compounds to load from Enamine.

The good news is that once the library is prepared, it is quick to screen at more targets--and we make the pre-computed library available at zinc15.docking.org.

Interestingly, as the library grows a limiting factor is storing the library on disk. It is now ~20T. We've set up several mirrors around the world for groups that are actively using it. An interesting problem will be to see if preparing compounds for screening on the fly (e.g. with machine learning models) can overcome this limitation to keep up with library growth.

A big question for us is what will the return on investment in screening larger and larger libraries be? One of the take aways from this work is if docking has moderate enrichment, than screening larger libraries not only gives more hits but actually can increase the hit-rate for the top scoring compounds.

3 comments

This is exactly why I typically have all sorts of prior rejection criterion to trim down my (relatively humble) set of 17 million trial molecules. I don't have a proper cluster like you-all do. So,'beggars' need to do the easy rejections early (e.g. >12 rotatable bonds? 11 H-bond donors? partition coefficient of 7.3? therefore no further consideration is needed) ADME before an expensive docking.
I know that docking using GPU is about an order of magnitude faster than CPU (see today's Schrodinger 2019-1 release notes, https://youtu.be/K4AYdBvuOe4?t=90). Is there a way of doing GPU accelerated precomputation though?
Hey Chris--We're right now using a mix of commercial and open source software like Omega, Corina, AMSOL, and Mol2DB. Probably the slowest step is generating the partial charges for each conformer with a reasonably high quality semi-empirical forcefield. I'm not sure if there are competitive (in terms of quality) GPU based methods, but if there were methods that were ~1000 times faster as can be the case for GPU based methods, it would definitely speed up the pre-computation or make on-the-fly prep feasible. Do you have any ideas of where we should look?
I'm curious why 20 terabytes on disk is a challenge. That sounds like a bucket in S3 problem