Ask HN: How are you getting reliable code-gen performance out of LLMs?

Your post would make more sense to me if you were specific about the models. It's like if you were asking about how to get reliable transportation from a car and didn't specify which model of cars you were considering.

o1-preview seems to be a step up from Claude 3.5 Sonnet.

There are many open source coding LLMs that for complex tasks will be a joke compared to the SOTA closed ones.

I think that there are two strategies that can work: 1) constrain the domain to a particular framework and provide good documentation and examples in the prompts for it, and 2) create an error-correcting feedback loop where compilation/static analysis and runtime errors or failed tests are fed back to the model automatically.