| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hashta 309 days ago
	One caveat that’s easy to miss: the "simple" model here didn’t just learn folding from raw experimental structures. Most of its training data comes from AlphaFold-style predictions. Millions of protein structures that were themselves generated by big MSA-based and highly engineered models. It’s not like we can throw away all the inductive biases and MSA machinery, someone upstream still had to build and run those models to create the training corpus.

4 comments

aDyslecticCrow 309 days ago

What i take away is the simplicity and scaling behavior. The ML field often sees an increase in module complexity to reach higher scores, and then a breakthrough where a simple model performs on-par with the most complex. That such a "simple" architecture works this well on its own, means we can potentially add back the complexity again to reach further. Can we add back MSA now? where will that take us?

My rough understanding of field is that a "rough" generative model makes a bunch of decent guesses, and more formal "verifiers" ensure they abide by the laws of physics and geometry. The AI reduce the unfathomably large search-space so the expensive simulation doesn't need to do so much wasted work on dead-ends. If the guessing network improves, then the whole process speeds up.

- I'm recalling the increasingly complex transfer functions in redcurrant networks,

- The deep pre-processing chains before skip forward layers.

- The complex normalization objectives before Relu.

- The convoluted multi-objective GAN networks before diffusion.

- The complex multi-pass models before full-convolution networks.

So basically, i'm very excited by this. Not because this itself is an optimal architecture, but precisely because it isn't!

nextos 309 days ago

> Can we add back MSA now?

Using MSAs might be a local optimum. ESM showed good performance on some protein problems without MSAs. MSAs offer a nice inductive bias and better average performance. However, the cost is doing poorly on proteins where MSAs are not accurate. These include B and T cell receptors, which are clinically very relevant.

Isomorphic Labs, Oxford, MRC, and others have started the OpenBind Consortium (https://openbind.uk) to generate large-scale structure and affinity data. I believe that once more data is available, MSAs will be less relevant as model inputs. They are "too linear".

godelski 309 days ago

Is this so unusual? Almost everything that is simple was once considered complex. That's the thing about emergence, you have to go through all the complexities first to find the generalized and simpler formulations. It should be obvious that things in nature run off of relatively simple rulesets, but it's like looking at a Game of Life and trying to reverse engineer those rules AND the starting parameters. Anyone telling you such a task is easy is full of themselves. But then again, who seriously believes that P=NP?

hashta 309 days ago

To people outside the field, the title/abstract can make it sound like folding is just inherently simple now, but this model wouldn’t exist without the large synthetic dataset produced by the more complex AF. The "simple" architecture is still using the complex model indirectly through distillation. We didn’t really extract new tricks to design a simpler model from scratch, we shifted the complexity from the model space into the data space (think GPT-5 => GPT-5-mini, there’s no GPT-5-mini without GPT-5)

godelski 309 days ago

  > To people outside the field

So what?

It's a research paper. That's not how you communicate to a general audience. Just because the paper is accessible in terms of literal access doesn't mean you're the intended audience. Papers are how scientists communicate to other scientists. More specifically, it is how communication happens between peers. They shouldn't even be writing for just other scientists. They shouldn't be writing for even the full set of machine learning researchers nor the full set of biologists. Their intended audience is people researching computational systems that solve protein folding problems.

I'm sorry, but where do you want scientists to be able to talk directly to their peers? Behind closed doors? I just honestly don't understand these types of arguments.

Besides, anyone conflating "Simpler than You Think" as "Simple" is far from qualified from being able to read such a paper. They'll misread whatever the authors say. Conflating those two is something we'd expect from an Elementary School level reader who is unable to process comparative statements.

I don't think we should be making that the bar...

hashta 309 days ago

It’s literally called "SimpleFold". But that’s not really my point, from your earlier comment (".. go through all the complexities first to find the generalized and simpler formulations"), I got the impression you thought the simplicity came purely from architectural insights. My point was just that to compare apples to apples, a model claiming "simpler but just as good" should ideally train on the same kind of data as AF or at least acknowledge very clearly that substantial amount of its training data comes from AF.

I’m not trying to knock the work, I think it’s genuinely cool and a great engineering result. I just wanted to flag that nuance for readers who might not have the time or background to spot it, and I get that part of the "simple/simpler" messaging is also about attracting attention which clearly worked!

godelski 309 days ago

  > I got the impression you thought the simplicity came purely from architectural insights.

I'm unsure where I indicated that, but apologize for the confusion. I was initially pushing back against your original criticism of something like Alphafold having needed to be built first.

Like you suggest, simple can mean many things. I think it's clear that in this context they mean "simple" (not from an absolute sense) in terms of the architectural design. I think the abstract is more than sufficient to convey this.

  > My point was just that to compare apples to apples

As a ML researcher who does a lot of work on architecture and efficiency, I think they are. Consider this from the end of the abstract

  | SimpleFold shows efficiency in deployment and inference on consumer-level hardware.

To me they are clearly stating that their goal isn't to get the top score on a benchmark. Their appendix shows that the 100M param is apples to apples to alphafold2 by size but not by compute. Even their 3B model uses less compute then alphafold2.

So being someone in a neighboring niche, I don't understand your claim. There's no easy way to make your comparisons "apples to apples" because we shouldn't be evaluating on a single metric. Sure, alphafold2 gives better results on the benchmarks but does that mean people wouldn't sacrifice performance for a 450x reduction in compute? (20x for their largest model. But note that compute, not memory).

  >  messaging is also about attracting attention

Yeah this is an unfortunate thing and I'm incredibly frustrated with this in academia and especially in ML. But it's also why I'm pushing against you. The problem stems from needing to get people to read your paper. There's a perverse incentive because you could have a paper that is groundbreaking but ends up having little to no impact because it didn't get read. A common occurrence is that less innovative papers will get magnitudes more citations by using similar methods but scale and beat benchmarks. So unfortunately as long as we use citation metrics as a significant measure of our research impact then marketing will be necessary. A catchy title is a good way to get more eyeballs. But I think you're being too nitpicky here and there's far more egregious/problematic examples. I'm not going to pick my fight with a title when the abstract is sufficiently clear. Could it be more clear? Certainly. But if the title is all that's wrong then it's a pretty petty problem. Especially if it's only confusing people who are significantly outside the target audience.

Seriously, what's the alternative? That researchers write to the general public? To the general technical public? I'm sorry, I don't think that's a good solution. It's already difficult to communicate to people in the same domain (but not niche) in the page limit. It's hard to be them to read everything as it is. I'd rather papers be written strongly for the niche peers and enough generalization that domain experts can get through it with effort. For the general public, that's what science communicators are for

stavros 309 days ago

But this is just a detail, right? If we went and painstakingly catalogued millions of proteins, we'd be able to use the simple model without needing a complex model to generated data, no?

connorbrinton 308 days ago

Technically yes. But it can take months to years to experimentally obtain the structure for a single protein, and that assumes that it's possible to crystallize (X-ray), prepare grids (cryo-EM) or highly concentrate (NMR) the protein at all.

On the other hand, validating a predicted protein structure to a good level of accuracy is much easier (solvent accessibility, mutagenesis, etc.). So having a complex model that can be trained on a small dataset drastically expands the set of accurate protein structure samples available to future models, both through direct predictions and validated protein structures.

So technically yes, this dataset could have been collected solely experimentally, but in practice, AlphaFold is now part of the experimental process. Without it, the world would have less protein structure data, in terms of both directly predicted and experimentally verified protein structures

stavros 308 days ago

I agree, I guess I'm saying that it's more of a quantitative improvement, rather than a qualitative one.

littlestymaar 309 days ago

> but this model wouldn’t exist without the large synthetic dataset produced by the more complex AF

This model could also have existed from natural data if we had access to enough of it.

inkysigma 308 days ago

Maybe, but then this seems more like an exercise in distillation rather than solving the original problem which is what the title "Folding proteins is simpler..." suggested to me at least. Part of the problem with any ML task is that data is usually limited and presumably far more limited than the amount of synthetic data you can generate.

slashdave 309 days ago

> It should be obvious that things in nature run off of relatively simple rulesets

Only if you are willing to call a billion years of evolutionary selection a "simple ruleset"

TeMPOraL 309 days ago

Evolution is a dumb, greedy search that can only work in extremely tiny increments, and every step has to result in a viable organism that's also at least as fit as it was before.

That means whatever evolution created, whether it's wings or brains, however complex it looks now, must be fundamentally simple enough it could be reached by iterating in tiny steps that were useful in isolation. It constrains the space of designs reachable by evolution considerably.

slashdave 308 days ago

> every step has to result in a viable organism that's also at least as fit as it was before.

Not true. Learn some genomics before trying to explain evolution.

TeMPOraL 308 days ago

I did. Sure, I'm glossing over some detail - in fact, in the passage you quoted, half of the words stand for something that would take paragraph each to expand on - but that doesn't conflict with the zoomed-out perspective. Can you tell me where you think I'm wrong about the gist of it?

godelski 309 days ago

Run a game of life for a billion years and tell me if your answer is the same. You can accelerate that so I'll wait.

Does the time matter? A ruleset doesn't change with time.

If you're still unconvinced, get a degree in physics. I'm not sure how you could get through that and still not believe that complexity rises from simplicity and how you end up getting drops in that complexity, which we call emergence, before becoming more complex than before.

slashdave 308 days ago

And now you compare biology to the game of life...

godelski 308 days ago

I want you to read what you wrote again...

But you really do seem to be trying hard to miss the point entirely. Life has actually nothing to do with what I said did it. And I can assure you, by nature of being one, that physicists are certain that nature follows simple rules, even if we don't know them.

We are also absolutely confident in that complexity rises out of simplicity. Go look at anything like fractals, chaos theory, perturbation theory, or you should have run into at least bifurcation diagrams in your differential equations course. If you haven't taken diff eq, then well.... perhaps the problem is that your confidence in your result is stronger than your expertise. If not, well... make a real argument because I'm not going to hold your hand through this any longer.

slashdave 308 days ago

Except I have a PhD in physics...

The thing is, Biology is anything besides simple.

mapmeld 309 days ago

And AlphaFold was validated with experimental observation of folded proteins using X-rays

slashdave 309 days ago

Correct. For those that might not follow, the MSA is used to generalize from known PDB structures to new sequences. If you train on AlphaFold2 results, those results include that generalization, so that your model no longer needs that capability (you can rely on rote memorization). This simple conclusion seems to have escaped the authors.