|
|
|
|
|
by lmeyerov
523 days ago
|
|
I'm more confused now. If this is a tough and high-value task, we would not use gpt-4o-mini on its own, eg, add more steps like a verifier & retry, or just do gpt-4o to begin with, and would more seriously consider fine-tuning in addition to the prompt engineering. The blog argued against that, but maybe I read too quickly. And agreed, people expect $ they invest into computer systems to do much better than their bad & avg employees. AI systems get the added challenge where they must do ~100% on what non-AI rules would catch ("why are you using AI?") + extra lift from AI ("what did this add?"). We generally get evaluated on matching experts (low bar), and exceeding them (high bar). Comparing to average staff is, frustratingly, a breakout. Each scenario is different obviously.. |
|
FWIW – across all of these, we already do automated prompt rewriting, self-reflection, verification, and a suite of other things that help maximize reliability / quality, but those tokens add up quickly and being able to dynamically switch over to a smaller model without degrading performance improves margin substantially.
Fine-tuning is a non-starter for a number of reasons, but that's a much longer post.