After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure1. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence.
The metric they use (residues) is a bit unusual (I would have used number of proteins instead), but I assume they wanted to account for ambiguity (such as proteins with partial structures).
One of the reasons we don't have them all is that individual genes can encode for multiple protein isoforms through alternative splicing. AlphaFold was only run on one. Otherwise, there's lots of important biochemical/biophysical processes that impact structure, as cells are only about 50% protein by weight.
---
After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure1. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence.
https://www.nature.com/articles/s41586-021-03828-1
---
The metric they use (residues) is a bit unusual (I would have used number of proteins instead), but I assume they wanted to account for ambiguity (such as proteins with partial structures).