|
|
|
|
|
by lysozyme
1182 days ago
|
|
It’s a fun time to be interested in AI for proteins. Every new ML model type is inevitably tried out on proteins. As the functional molecules of life, proteins are uniquely important and fundamental to every process in biology. As the targets for every drug, the tools for every cellular job, and the squishy, wiggly, moving and alive parts of living things, proteins presents both exciting possibilities and deep technical challenges for those who design them. A protein can be understood simply as a string of letters about 300 long using the alphabet ACDEFGHIKLMNPQRSTVWY. This turns out to be a great representation for sequence models like transformers. One big public database, UniProt, has 200 million protein sequences you can train your model on. The very largest plain transformer models trained on protein sequences (analogous to plain text) are about 15B parameters (I am thinking of Meta AI’s ESM-2 [1]). These can do for protein sequences what LLMs do for text (that is, they can “fill in the blank” to design variations, generate new proteins that look like their training data), and tell you how likely it is that a given sequence exists. Some cool variations of transformers have applications for protein design, like the now-famous SE(3) equivariant transformer used in the structure prediction module of AlphaFold [2], now appearing in the research paper [3] accompanying TFA, as well as variations on the transformer such as the message passing model ProteinMPNN [4], which builds on a neighbor graph-structured transformer [5] 1. https://github.com/facebookresearch/esm 2. https://github.com/deepmind/alphafold 3. https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2 4. https://github.com/dauparas/ProteinMPNN 5. https://github.com/jingraham/neurips19-graph-protein-design |
|