| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lysozyme 717 days ago

The flip side of this is that progress in ML for biology is always going to be _slower_ than progress in ML for natural languages and images [1].

Humans are natural machines capable of sensing and verifying the correctness of a piece of text or an image in milliseconds. So if you have a model that generates text or images, it’s trivial to see if they’re any good. Whereas for biology, the time to validate a model’s output is measured more in weeks. If you generate a new backbone with RFDiffusion, and then generate some protein sequences with LigandMPNN, and then want to see if they fold correctly … that takes a week. Every time. Use ML to solve _that_ problem and you’ll be rich.

TFA mentions the difficulty of performing biological assays at scale, and there are numerous other challenges. Such as the number of different kinds of assays required to get the multimodal data needed to train the latest models like ESM-3 (which is multimodal, in this context meaning primary sequence, secondary structure, tertiary structure, as well as several other tracks). You can’t just scale a fluorescent product plate reader assay to get the data you need. We need sequencing tech, functional assays, protein-protein interaction assays, X-ray crystallography, and a dozen others, all at scale.

What I’d love to see companies like A-Alpha and Gordian and others do is see if they can use the ML to improve the wet lab tech. Make the assays better, faster, cheaper with ML. Like how they use ML to translate the electrical signals of DNA passing through the pore into a sequence in the Nanopore sequencers. So many companies have these sweet assays that are very good. In my opinion, if we want transformative progress in biology, we should spend less time fitting the same data with different models, and spend more time improving and scaling wet lab assays using ML. Can we use ML to make the assay better, make our processes better, to improve the amount and quality of data we generate? The thesis of TFA (and experience) suggests that using the data will be the easy part

1. https://alexcarlin.bearblog.dev/why-is-progress-slow-in-gene...