Here is a problem I've been noodling with. If you are a decent programmer, how does your LLM help you solve this problem?
Given a cheminformatics fingerprint definition based on SMARTS substructure patterns, come up with a screening filter, likely using a decision tree, which uses intermediate feature tests to prune search space faster than simply testing each pattern one-by-one.
which could be improved by an element count test - count the number of fluorines, and only do the test if there are enough atoms in the molecule to fingerprint.
So one stage might be to construct a list of element counts;
ele_counts = [0]*200
seen = set()
for atom in mol.GetAtoms():
ele_counts[eleno:=atom.GetAtomicNum()] += 1
seen.add(eleno)
then have a lookup table for each element, based on the patterns which have at least that count of the given element type;
ele_patterns = [
# max known count, list of set of matching patterns
(0, [set()]), # element 0
(0, [set()]), # hydrogen
..
(20, [{all patterns which contain no carbon},
{all patterns which require at most 1 carbon}, ...
{all patterns which require at most 19 carbons}],
(10, [{all patterns which contain no fluorine}, ..
{all patterns which contain at most 9 fluorines}],
...]
However, this is not sophisticated enough to identify which other tests, like the "CC(=NNC=O)C" example I gave before, or "S(=O)(=O)", which might be good tests at a higher level than the element.
And clearly if there isn't a sulphur, aren't two oxygens, and aren't two double bonds then there's no need to test "S(=O)(=O)", suggesting a tree structure would be useful.
LLMs are pretty good at giving you what you ask for. Not so good at telling you that you're asking for the wrong thing.