Hacker News new | ask | show | jobs
by momeara 3530 days ago
Molecular dynamics simulations can be used to answer a range of structural biology questions, but abstractly many of them can be phrased as evaluating the difference in free energy between different conformational states. In molecular dynamics this is done by thermodynamic integrating the energy of over the state space volume for each of the conformational states.

An alternative approach is to directly map conformational states to their free energy. This leads to a problem of searching for candidate conformational states (e.g. the folded state, transition states etc.) and scoring them. Usually for a given computational budget there is a trade off between better conformational sampling or higher accuracy energy scoring.

Historically, searching and scoring methods have been designed separately. For example [1] improves sampling while [2] improves energetics. This is done because they historically involved different aspects of the simulation and each is lot of work. But searching and sampling are not really separable, in that the deeper one samples the more challenging the task of the scoring function becomes--discriminating stable from unstable conformations.

Another application that can be thought of as searching and scoring is the game of GO. My impression is that one of the major breakthroughs with AlphaGo is that they were able to integrate models for searching and scoring together and learn the models simultaneously. It would be awesome if similar architectures could be applied to molecular modeling.

A remaining challenge in applying GO models to molecular biology is that while the representation and scoring rules for GO are fixed and quite easy, the ground truth for molecular simulations comes from heterogenous experimental data (X-ray crystal structures, small molecule activities, directed evolution antibody screens etc.) and higher levels of theory QM simulations, which have their own challenges. However, I think the principles carry over--complicated scoring functions (e.g. free energy) over large state spaces (e.g. protein conformation space or chemical space) can be learned by combining models for searching and scoring. I think deep learning is poised to tackle these problems.

[1] (Conway, et al., 2013, DOI: 10.1002/pro.2389) Relaxation of backbone bond geometry improves protein energy landscape modeling

[2] (Park, 2016, PMID: 27766851) Simultaneous optimization of biomolecular energy function on features from small molecules and macromolecules.

1 comments

I'm in this field so I'm quite familiar with the search/score situation, but thanks for clearly mapping out the challenge and identifying where you think neural networks will be most beneficial. I just think the particulars make this an immensely different story than GO, and not just the point you describe as the "remaining challenge".

The search space is vast in GO, but it inevitably shrinks over the game, where as in MD simulations it does not shrink, proteins can fold and unfold. There are a fixed number of possible legal play positions to play in GO, but the legal moves for protein conformation fluctuates wildly, is governed by physics (which you would need to relearn), and likely to be much larger than GO since it's continuous. In simulations, you care about successive moves, where as AlphaGO does not care about time-dependent properties (there are also kinetic observables, like folding rates that seem non-intuitive to evaluate without simulations). Even if you sampled enough conformations on some pathway, perhaps some sort of allosteric change, how would you know how fast it happens? In GO, you always play the same game, but in simulations, you often play different games, i.e, you don't want to be unfolding your protein when you are studying ligand binding. In a similar vein, imagine a single-point mutation that causes protein misfolding. It seems to me that you'd need to retrain your search/score algorithm for each new protein sequence, which doesn't seem like you're saving much time/complexity. There is also a huge problem in scale. We're talking about proteins varying from hundreds to hundreds of thousands of atoms/dihedrals/contacts, not to mention sampling water in the active sites of druggable proteins.

I think it could work in principle, but a physics-based approach sure seems elegant by comparison.

Hi Chris,

You bring up very good issues and perhaps I'm being too optimistic. I definitely agree that there isn't going to be one single mapping of sequence --> energy landscape any time soon or even ever.

But I think there are subproblems that are easier because the search space is more limited[1] or the chemistry is easier (e.g. avoiding chemical reactions or interactions with high energy fields). I think often the major modeling challenge is identifying when it is feasible to take advantage of problem constraints or when lower levels of theory can be used. For example there are a range of "enhanced sampling methods" for molecular dynamics that e.g. constrain the the simulation to a reaction coordinate or assume Markov transitions between states so they can be computed on a distributed cluster.

Taking advantage of these opportunities often requires a fair amount of engineering to build appropriate representations. I wonder to what extent these representations can be learned?