Hacker News new | ask | show | jobs
by aheifets 4128 days ago
I’m the cofounder and CEO of Atomwise and, since this is Hacker News, I thought I’d cover the technical details a bit more: We run deep neural networks on one of the biggest supercomputers (#74 in the world, http://www.top500.org/site/50424) to predict whether a molecule will stick to a disease target (its “binding affinity”). Understanding the binding affinity of a molecules is one of the essential questions in finding new medicines; it comes up over and over in the drug discovery pipeline, including in hit discovery, toxicity prediction, and personalized medicine.

Our goal is to bring to medicine discovery the same kind of incredible efficiency gains that computation gave us in aerospace and mechanical engineering design. Today, people have to physically synthesize and physically test molecules to figure out how they’re going to behave. That’s incredibly laborious, expensive, and time consuming. We’re able to get the same results, but in days instead of months or years.

Given all of the new and re-emerging diseases we’re encountering (such as Ebola, measles, malaria, and drug-resistant infections, to name a few that we’ve worked on), I think our species needs all of the help we can get in finding new medicines. I’m happy to answer questions about what we’re doing, or the challenges we encounter when we take deep learning algorithms out beyond image classification.

4 comments

Thanks for answering questions! I'd be curious to know where you got your (presumably massive) data from to train a NN to spit out what seems to be binding affinity between two candidates (drug and target). Do you guys use a NN for each target? I know you may not be able to answer these questions :)

I hope your team succeeds, keep up the hard work!

Thank you for the kind wishes!

Over the past few years, there's been a huge increase in the amount of data available for this kind of machine learning. We curate our data from a number of private and public sources. For example, as part of my doctoral work (http://en.wikipedia.org/wiki/SCRIPDB), I learned how to parse chemical information out of U.S. Patent data, which is public domain. That said, if you're interested in working on something like this and need a quick million data points, I'd point you to PubChem as a first step: https://pubchem.ncbi.nlm.nih.gov/

Abraham,

This is the kind of work that is essential for our future --so thank you. I'm quite (positively) surprised Sam decided to fund this; we definitely need to put more effort and resources as a civilization to work like this.

I'll hopefully have a lot more to say in the future, and will definitely be reaching out in a more substantive way... But in the meantime, a quick word of advice: This will sound strange, since Hinton's work is quite powerful as it is, but my guess is that good-old boosting / \ell_1-regularized ensemble learning methods would work much better for this particular problem domain --so please run some experiments and look into it, if you haven't already. It's hard to find good and up-to-date literature on this (nowadays) less fashionable work (a good rule of thumb: if it mentions 'random forests', it is not well-informed enough), but Freund and Schapire's recent book [1] is self-contained and a jewel to read back-to-back. Best of luck.

[1] http://mitpress.mit.edu/books/boosting

Thank you! Personally, I find it very exciting to be working on these problems.

With respect to boosting, we have more investigation to do, of course; the tricky issue with the biological domain is that we know the underlying data is incredibly noisy. How to walk the line of extracting maximum predictive performance without overfitting is the challenge, since we know that a lot of the raw data points are unreliable. Any algorithm we use has to be able to handle this scenario deftly.

Absolutely. There has been some work specifically on boosting in the presence of noise --see for instance [1], and Sec. 12.3.3 of Schapire's book-- using branching programs/BDDs as base learners. It's definitely worth taking a look.

[1] http://research.microsoft.com/en-us/um/people/adum/publicati...

As a techie who is not involved in machine learning, I see these AI articles coming up and I wonder if it's me or AI is indeed on the rise after such a long stagnation?

Is this newfound excitement and interest in AI fuelled by actual, objective advances?

EDIT: Keep up the good work, I sincerely hope that the work you do will make the world a better place. I wish I could help somehow.

New medical discoveries aside, we're seeing self-driving cars and speech recognition that runs on a cell phone. I grew up reading about those kinds of things in Asimov, so I personally find the progress pretty exciting.
Hi Abraham, could you describe some of the differences between Atomwise's approach and D.E. Shaw research's approach to computational identification of novel pharmaceuticals? Thanks!
As you might expect, there are trade-offs, and it's a question of picking the right tool for the job.

My understanding of D.E. Shaw's approach is that they're doing molecular dynamics, i.e. simulation. You get to watch the motion of every atom in the system. That allows for a close investigation of a given protein's movements, which is great especially if you're trying to learn about its biology. Unfortunately, it is rather computationally expensive; while I don't know DESRES's latest stats, I've seen reports on large parallel MD simulations completing about once per day.

In contrast, we've posed the question of binding as a machine learning problem. Neural networks are computationally expensive to train, but make predictions quickly. Our system can assess millions of protein-drug pairs per day, since we're not simulating the motion of every atom. You don't get to watch what each atom is doing, but you get insight into the behavior of lots of potential medicines.