Hacker News new | ask | show | jobs
by jostmey 2605 days ago
And predicting protein function is not that hard either. The ground truth labels are often determined by sequence alignment similarity, not by experiment. So the results are far from profound
1 comments

Doing it right is quite hard. Doing it usefully is even harder [1]. Getting a good training set without to many biases is the really hard part. Generating a ground truth that is actually a truth is very expensive.

I have to read the paper carefully again. But for the contact point prediction I think the training set will cover most of the data used in the validation. Due to they way PDB "sequences" are distributed over UniParc as well as how PDB 3D structures are generated experimentally. i.e. there are 120,000 pdb related sequences in UniParc, but they cover 45,000 ones in UniProtKB. Because PDB derived sequences are rarely full length, often mutated and highly duplicative in coverage.

[1] predicting the root GO terms will give you and insane TP/FP rate but is completely useless.