| I work at OpenAI (not on Codex) and have used it successfully for multiple projects so far. Here's my flow: - Always run more than one rollout of the same prompt -- they will turn out different - Look through the parallel implementations, see which is best (even if it's not good enough), then figure out what changes to your prompt would have helped nudge towards the better solution. - In addition, add new modifications to the prompt to resolve the parts that the model didn't do correctly. - Repeat loop until the code is good enough. If you do this and also split your work into smaller parallelizable chunks, you can find yourself spending a few hours only looping between prompt tuning and code review with massive projects implemented in a short period of time. I've used this for "API munging" but also pretty deep Triton kernel code and it's been massive. |
How can non-technical people tell what's "best"? You need to know what you're doing at this point, look for the right pitfalls, inspect everything in detail... this right here is the entire counter-argument for LLMs eliminating SWE jobs...