Hacker News new | ask | show | jobs
by miraculixx 784 days ago
Reading their blog post I wonder if an LLMs is really the best way to do this. If I got it right, they used the LLM to enumerate potential protein DNA sequences. Does that really need an LLM? Enumeration is not novel, nor are LLMs particularily good at it. If you want to computationally parallelize the search in a large enumeration space it would be much easier to simply, well, do that instead of taking a detour via a statistical parrot.

In a nutshell this sounds more like a case of "we wanted something with AI in the title".

7 comments

It's not an English LLM, but a "protein" language model, where tokens represent amino acids or nucleotides. Learning a transformer language model on such data simply learns a distribution over sequences of tokens. It's a fine approach conceptually that in many ways is the "right" way or most elegant method, and not a stretch at all.
I enjoyed the feeling when I made this connection talking with a startup doing this a while back. It's just a different "language" and although it's not a given that LLMs can operate in it, it's a reasonable thing to try, and it turns out they can.
Personally I think it was obvious that LLMs were going to be useful for protein modelling since the previous generation used HMMs very successfully. Pfam (a library of HMMs for classifying proteins into preexisting known families) is one of the most important resources we have because of the power of HMMs to model sequential language.

I suspect we will need to move from sequential modelling to graphical modelling to level-up again, though.

> I suspect we will need to move from sequential modelling to graphical modelling to level-up again, though.

Out of curiosity, would you mind elaborating on this?

I don't work in the field so I'm probably just repeating something Hinton already said, but it seems to me like attempting to model things in reality that have graph-like structures (like interacting pairs of residues in a 3d protein structure) using sequences with finite context lengths is ultimately going to be less efficient than modelling graphs. My guess is this work is roughly describing that I think of: https://www.cis.upenn.edu/~mkearns/papers/barbados/jordan-tu...

it could also be I completely misunderstand context in sequential models and what I'm describing is already being used, or has been evaluated and has been unsuccessful.

> Learning a transformer language model on such data simply learns a distribution over sequences of tokens.

If statistical distributions can model higher level polypeptide structure, then it could be useful.

yeah but your training is bottlenecked by the lack of ground truth. Some things were(I presume)/will be easy to do with LLMs, like protein structure, because every part of every protein is source data (and there's millions of known structure). But suppose you want to estimate clearance, or ld50. How many proteins do we know their serum clearance? 1000? 10000 maybe?
I don't have a direct answer to your question. My guess is that LLMs are too limited to make truly great solutions in biology but sequential modelling is a key component that will not be replaced any time soon. For example, transformers were key to AlphaFold's success, but they still needed many other steps to make accurate predictions.

I worked on a predecessor to LLMs - HMMs for protein modelling. They were, and still are for most people the best way to model protein sequences. It's usually done as prediction, rather than generation (IE, you use the model to classify an unknown sequence into a known category, rather than asking the model to generate new instances of a category). HMMs for proteins are a bit stuffy, and they model local changes well, but struggle with long-range interactions that LLMs seem to excel at (for example, an HMM will do a good job of letting you stuff a few more residues into a protein in a localized region such as a hinge, but are not so great at modelling groups of residues that are located far-apart in sequence space but close in protein space).

One detail of the bitter lesson is, imho, that statistical parrots are better than they "should" be, probably for the same reason that mathematics is unexpectedly proficient in modelling physics: to some degree, the models recapitulate the true latent space of the underlying system well enough to generalize outside the original observation space.

I think your intuition is off here. The number of sequences to enumerate is much greater than the number of atoms in the universe. You need a smart way to enumerate these and that's what the LLM is for. The statistical parrot is not a detour its a shortcut.
The real power of LLMs is they can model anything as a “language” given the right sequence training data.

Warning: the following is my opinion.

In the same way that MLP “neurons” are universal approximators, it seems that LLMs are universal mappers.

They have the potential to help us organize and translate the immense quantity of data being generated by modern methods in all respective disciplines. We might create a model that translates english to protein synthesis, and vice versa, which would be pretty useful given my lay understanding of biochem.

To your point - this probably is NOT the best way to do this in an objective sense. But to my mind we are hitting upper limits as finite beings and need things like this, which utilize native language constructs, to move forward.

First the search space is way too large for brute force enumeration. We’re talking like 10^300 combinations. Also the hard part isn’t just listing amino acid sequences, its finding ones that do what you want them to. The only way to figure that out is by testing them, which is difficult and expensive. So you need an algorithm that is good at only listing sequences that are likely to work. That’s precisely what LLM’s are good at: finding patterns and sequences that are correlated in a useful way
Well hopefully it's trained on genetic DNA sequences and not Reddit threads. If so, it should do pretty well predicting the next sequence given previous sequences. There are probably all sorts of undiscovered patterns.
To be fair, having AI in the title landed it on the front page of HN, so...