|
I'm in this field so I'm quite familiar with the search/score situation, but thanks for clearly mapping out the challenge and identifying where you think neural networks will be most beneficial. I just think the particulars make this an immensely different story than GO, and not just the point you describe as the "remaining challenge". The search space is vast in GO, but it inevitably shrinks over the game, where as in MD simulations it does not shrink, proteins can fold and unfold. There are a fixed number of possible legal play positions to play in GO, but the legal moves for protein conformation fluctuates wildly, is governed by physics (which you would need to relearn), and likely to be much larger than GO since it's continuous. In simulations, you care about successive moves, where as AlphaGO does not care about time-dependent properties (there are also kinetic observables, like folding rates that seem non-intuitive to evaluate without simulations). Even if you sampled enough conformations on some pathway, perhaps some sort of allosteric change, how would you know how fast it happens? In GO, you always play the same game, but in simulations, you often play different games, i.e, you don't want to be unfolding your protein when you are studying ligand binding. In a similar vein, imagine a single-point mutation that causes protein misfolding. It seems to me that you'd need to retrain your search/score algorithm for each new protein sequence, which doesn't seem like you're saving much time/complexity. There is also a huge problem in scale. We're talking about proteins varying from hundreds to hundreds of thousands of atoms/dihedrals/contacts, not to mention sampling water in the active sites of druggable proteins. I think it could work in principle, but a physics-based approach sure seems elegant by comparison. |
You bring up very good issues and perhaps I'm being too optimistic. I definitely agree that there isn't going to be one single mapping of sequence --> energy landscape any time soon or even ever.
But I think there are subproblems that are easier because the search space is more limited[1] or the chemistry is easier (e.g. avoiding chemical reactions or interactions with high energy fields). I think often the major modeling challenge is identifying when it is feasible to take advantage of problem constraints or when lower levels of theory can be used. For example there are a range of "enhanced sampling methods" for molecular dynamics that e.g. constrain the the simulation to a reaction coordinate or assume Markov transitions between states so they can be computed on a distributed cluster.
Taking advantage of these opportunities often requires a fair amount of engineering to build appropriate representations. I wonder to what extent these representations can be learned?