Hacker News new | ask | show | jobs
by throwaway24124 700 days ago
Are there any good resources for understanding models like this? Specifically a "protein language model". I have a basic grasp on how LLMs tokenize and encode natural language, but what does a protein language actually look like? An LLM can produce results that look correct but are actually incorrect, how are proteins produced by this model validated? Are the outputs run through some other software to determine whether the proteins are valid?
4 comments

Proteins are linear molecules consisting of sequences of (mostly) 20 amino acids. You can see the list of amino acids here: https://en.wikipedia.org/wiki/Amino_acid#Table_of_standard_a.... There is a standard encoding of amino acids using single letters, A for alanine, etc. Earlier versions of ESM (I haven't read the ESM3 paper yet) uses one token per amino acid, plus a few control tokens (beginning of sequence, end of sequence, class token, mask, etc.) Earlier versions of ESM were BERT-style models focused on understanding, not GPT-style generative models.
Agreed, would be interested if someone with more knowledge could comment.

My layman's understanding of LLMs is that they are essentially "fancy autocomplete". That is, you take a whole corpus of text, then train the model to determine the statistical relationships between those words (more accurately, tokens), so that given a list of tokens of length N, the LLM will find the next most likely token for N + 1, and then to generate whole sentences/paragraphs, you just recursively repeat this process.

I certainly understand encoding proteins as just a linear sequence of tokens representing their amino acids, but how does that then map to a human-language description of the function of those proteins?

Most protein language models are not able to understand human-language descriptions of proteins. Mostly they just predict the next amino acid in a sequence and sometimes they can understand certain structured metadata tags.
Can they understand the functional impact of different protein chains, or are they just predicting what amino acid would come next based on the training set with no concern for how the protein would function?
The way you would use a protein language model is different from how you would use a regular LLM like chatgpt. Normally, you aren't looking for one correct answer to your query but rather you would like thousands of ideas to try out in the lab. Biologists have techniques for trying out thousands or tens of thousands of proteins in a lab and filtering it down to a single candidate thats the best solution to whatever they are trying to achieve.
I recently saw this about AlphaFold: https://elanapearl.github.io/blog/2024/the-illustrated-alpha.... I don't think it's going to answer all your question but it might still help!