AlphaFold very explicitly (unless something has changed) removes NMR structures as references because they are not accurate enough. I have a PhD in NMR biomolecular structure and I wouldn't trust. the structures for anything.
Hmm, I would say its always worth to share knowledge. Could you paste some links or maybe type a few key-words for anyone willing to reasearch the topic further on his own.
Looking at the supplementary material (section 2.5.4) for the AlphaFold 3 paper it reads to me like they still use NMR structures for training, but not for evaluating performance of the model.
I think it's implicit in their description of filtering the training set, where they say they only include structures with resolution of 9A or less. NMR structures don't really have a resolution, that's more specific to crystallography. However, I can't actually verify that no NMR structures were included without directly inspecting their list of selected structures.
I think it is very plausible that they don't use NMR structures here, but I was looking for a specific statement on it in the paper. I think your guess is plausible, but I don't think the paper is clear enough here to be sure about this interpretation.
Yes, thanks for calling that out. In verifying my statement I actually was confused because you can see they filter NMR out of the eval set (saying so explicitly) but don't say that in the test set section (IMHO they should be required to publish the actual selection script so we can inspect the results).
> Input mmCIFs are restricted to have resolution less than 9 Å. This is not a very restrictive filter and only removes around 0.2% of structures
NMR structures are more than 0.2% so that doesn't fit to the assumption that they implicitly remove NMR structures here. But if I filter by resolution on the PDB homepage it does remove essentially all NMR structures. I'm really not sure what to think here, the description seems too soft to know what they did exactly.
interesting observation and experience. must have made thesis development complex, assuming the realization dawned on you during the phd.
what do you trust more than NMR?
AF's dependence on MSAs also seems sub-optimal; curious to hear your thoughts?
that said, it's understandable why they used MSAs, even if it seems to hint at winning CASP more than developing a generalizable model.
arguably, MSA-dependence is the wise choice for early prediction models as demonstrated by widespread accolades and adoption, i.e., it's an MVP with known limitations as they build toward sophisticated approaches.
My realizations happened after my PhD. When I was writing my PhD I still believed we would solve the protein folding and structure prediction problems using classical empirical force fields.
It wasn't until I started my postdocs, where I started learning about protein evolutionary relationships (and competing in CASP), that I changed my mind.
I wouldn't say it so much as "multiple sequence alignments"; those are just tools to express protein relationships in a structured way.
If Alphafold now, or in the future, requires no evolutionary relationships based on sequence (uniprot) and can work entirely by training on just the proteins in PDB (many of which are evoutionarily related) and still be able to predict novel folds, it will be very interesting times. The one thing I have learned is that evolutionary knowledge makes many hard problems really easy, because you're taking advantage of billions of years of nature and an easy readout.