Hacker News new | ask | show | jobs
by mxwsn 784 days ago
It's not an English LLM, but a "protein" language model, where tokens represent amino acids or nucleotides. Learning a transformer language model on such data simply learns a distribution over sequences of tokens. It's a fine approach conceptually that in many ways is the "right" way or most elegant method, and not a stretch at all.
3 comments

I enjoyed the feeling when I made this connection talking with a startup doing this a while back. It's just a different "language" and although it's not a given that LLMs can operate in it, it's a reasonable thing to try, and it turns out they can.
Personally I think it was obvious that LLMs were going to be useful for protein modelling since the previous generation used HMMs very successfully. Pfam (a library of HMMs for classifying proteins into preexisting known families) is one of the most important resources we have because of the power of HMMs to model sequential language.

I suspect we will need to move from sequential modelling to graphical modelling to level-up again, though.

> I suspect we will need to move from sequential modelling to graphical modelling to level-up again, though.

Out of curiosity, would you mind elaborating on this?

I don't work in the field so I'm probably just repeating something Hinton already said, but it seems to me like attempting to model things in reality that have graph-like structures (like interacting pairs of residues in a 3d protein structure) using sequences with finite context lengths is ultimately going to be less efficient than modelling graphs. My guess is this work is roughly describing that I think of: https://www.cis.upenn.edu/~mkearns/papers/barbados/jordan-tu...

it could also be I completely misunderstand context in sequential models and what I'm describing is already being used, or has been evaluated and has been unsuccessful.

> Learning a transformer language model on such data simply learns a distribution over sequences of tokens.

If statistical distributions can model higher level polypeptide structure, then it could be useful.

yeah but your training is bottlenecked by the lack of ground truth. Some things were(I presume)/will be easy to do with LLMs, like protein structure, because every part of every protein is source data (and there's millions of known structure). But suppose you want to estimate clearance, or ld50. How many proteins do we know their serum clearance? 1000? 10000 maybe?