| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jinto36 1265 days ago

Alphafold is a big improvement, but a structure of a single protein in isolation isn't representative of how these things exist in vivo. Binding substrates can modify protein shapes, and proteins often function in complexes, which can form some pretty complex arrangements, where positioning is critical to function. I think training set bias is an issue to some extent, even with single-protein prediction. For example, I've been looking at a family of transcription factors, and most of the resolved crystal structures are of just the DNA-binding domain, crystallized with the substrate (DNA) bound. Alphafold predictions for homologous proteins that haven't been experimentally resolved but share a decent amount of sequence similarity thus have high confidence for the DNA-binding domain, but lower confidence in other parts of the protein, even if they're "ordered" regions (e.g. helices and sheets rather than floppy loops), and all the predictions for the DNA-binding domain look like the bound-to-DNA conformation. So we don't have a good way yet to predict different "modes" of a protein that has interaction-dependent conformations. Technically with Alphafold if you were interested in modelling a protein that had similar experimentally resolved both with and without substrates bound, but were interested in sampling just one of those states, you could customize your sequence database to include one or the other, which would be mostly manual curation.

I've been testing out the multimer (protein complex) mode of Alphafold recently, to see if could predict interactions for a family of proteins where some members in the family are known to form complexes, but others previously were found to not form complexes at least when expressed in vitro rather than in vivo. So far I've found that if you try to throw two completely unrelated proteins together, they won't be modeled with any contacts, but for the ones in the family I'm interested in, there's always at least one (of the five models per run) that has them interacting such that there's something that looks like a real DNA-binding domain. For the latter case, it's presently hard to know based just on Alphafold output if it's a structure that could actually form, or if it's just due to bias in the training data, with perhaps the rest of the structured regions of the protein being conformed in unrealistic ways due to less training information for those parts.

TL;DR Alphafold results are biased by existing experimentally resolved structures, and not based on simulating physics, so proteins- or parts of proteins- that don't have good coverage in existing experimental data are not going to be predicted with high confidence.