Hacker News new | ask | show | jobs
by praccu 1584 days ago
Shameless self promotion: I wrote one of the more cited papers in the field [0], back in 2016.

A key challenge: very few labs have enough data.

Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work. One of our more useful tasks turned out to be interactively helping scientists refine their experiments before running them.

Another was helping scientists develop hypotheses for _why_ reactions were occuring, because they hadn't been able to build principled models that predicted which properties were predictive of reaction formation.

Going all the way to synthesis is nice, but there's a lot of lower hanging fruit involved in making scientists more effective.

[0] https://www.nature.com/articles/nature17439

5 comments

This is true. Getting datasets with the necessary quality and scale for molecular ML is hard and uncommon. Experimental design is also a huge value add, especially given the enormous search space (estimates suggest there are more possible drug-like structures than there are stars in the universe). The challenge is figuring out how to do computational work in a tight marriage with the lab work to support and rapidly explore the hypotheses generated by the computational predictions. Getting compute and lab to mesh productively is hard. Teams and projects have to be designed to do so from the start to derive maximum benefit.

Also shameless plug: I started a company to do just that, anchored to generating custom million-to-billion point datasets and using ML to interpret and design new experiments at scale.

> A key challenge: very few labs have enough data.

It is also getting harder, not easier, to get.

I am working right now on a retro synthesis project. Our external data provider is raising prices while removing functionality, and no one bats an eye. At the same time our own data is considered a business secret and therefore impossible to share.

As someone who does NLP research where the code, data and papers are typically free, this drives me insane.

Are you using NLP to guide what molecules are probably worthwhile to try and synthesize?
A bit. But my main project was to use NLP to identify failed reactions in old lab notebooks to use as negative training data.
Question: How are labs doing the exploratory work without a clear hypothesis? Are they essentially doing some version of brute force?
Experienced chemists can look at molecule diagrams and have an intuition as to its activity and similarity to other known molecules. It’s like most of science and math: most discoveries begin with intuition and are demonstrated rigorously afterwards. I believe Poincare said something to this end.
Ok, so these experienced chemists can be replaced by AI now?
In the same way radiologist can be replaced by AI. So, no.
Radiologists have a high responsibility of detecting the right things.

Chemists can just try out things.

I don't think you can compare the two.

I was implying that you still need a human to make the final decision. AI can be a valuable aid in both fields. Doctors can't just let the AI do all the work in the same way synthetic chemists can't blindly trust the AI to spit out correct and feasible results. Research time is expensive and thus the effort needs to be evaluated, and usually the intuition of said chemists trump that of the AI.
Not the focus of the article, but analytical chemists need to do a lot of proper detecting themselves to be high-performing just like the radiologists do.
The brain is incredibly good at pattern matching while not necessarily being able to articulate why they came to that decision. Organic chemistry has these types of relations in spades. Say for example crystallization. You can kinda brute force it; there's only a few dozen realistic solvents to try, but that's a single solvent system. Then there's binary and ternary solvent systems. Then there's heat/cooling profiles, antisolvent addition, all kinds of things. Hundreds or thousands of possible experiments.

You might just decide that a compound "needs" isopropanol/acetone, plus a bit of water, cause something vaguely similar you encountered years ago crystallized well. You often start with some educated guesses and refine based on what you see.

But there's often no clear hypothesis, no single physical law the system obeys.

I'm trying to get a startup off the ground that tackles this.

Would love to chat more with you about this.

Me too, also tech nomad. I'll email you
> Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work.

This lets you stumble over unknown unknowns. Taylor et al discovered high-speed steel by ignoring the common wisdom and doing a huge number of trials, arriving at a material and treatment protocol that improved over the then-state-of-the-art tool steels by an order of magnitude or more. The treatment mechanism was only understood 50-60 years later.