| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throwaway24124 748 days ago
	Are there any good resources for understanding models like this? Specifically a "protein language model". I have a basic grasp on how LLMs tokenize and encode natural language, but what does a protein language actually look like? An LLM can produce results that look correct but are actually incorrect, how are proteins produced by this model validated? Are the outputs run through some other software to determine whether the proteins are valid?

4 comments

cottonseed 748 days ago

Proteins are linear molecules consisting of sequences of (mostly) 20 amino acids. You can see the list of amino acids here: https://en.wikipedia.org/wiki/Amino_acid#Table_of_standard_a.... There is a standard encoding of amino acids using single letters, A for alanine, etc. Earlier versions of ESM (I haven't read the ESM3 paper yet) uses one token per amino acid, plus a few control tokens (beginning of sequence, end of sequence, class token, mask, etc.) Earlier versions of ESM were BERT-style models focused on understanding, not GPT-style generative models.

link

hn_throwaway_99 748 days ago

Agreed, would be interested if someone with more knowledge could comment.

My layman's understanding of LLMs is that they are essentially "fancy autocomplete". That is, you take a whole corpus of text, then train the model to determine the statistical relationships between those words (more accurately, tokens), so that given a list of tokens of length N, the LLM will find the next most likely token for N + 1, and then to generate whole sentences/paragraphs, you just recursively repeat this process.

I certainly understand encoding proteins as just a linear sequence of tokens representing their amino acids, but how does that then map to a human-language description of the function of those proteins?

link

changoplatanero 748 days ago

Most protein language models are not able to understand human-language descriptions of proteins. Mostly they just predict the next amino acid in a sequence and sometimes they can understand certain structured metadata tags.

link

_heimdall 747 days ago

Can they understand the functional impact of different protein chains, or are they just predicting what amino acid would come next based on the training set with no concern for how the protein would function?

link

changoplatanero 748 days ago

The way you would use a protein language model is different from how you would use a regular LLM like chatgpt. Normally, you aren't looking for one correct answer to your query but rather you would like thousands of ideas to try out in the lab. Biologists have techniques for trying out thousands or tens of thousands of proteins in a lab and filtering it down to a single candidate thats the best solution to whatever they are trying to achieve.

link

sapsan 748 days ago

I recently saw this about AlphaFold: https://elanapearl.github.io/blog/2024/the-illustrated-alpha.... I don't think it's going to answer all your question but it might still help!

link