Hacker News new | ask | show | jobs
by sgk284 526 days ago
re: 90% – this particular case is a fairly subjective and creative task, where humans (and the LLM) are asked to follow a 22 page SOP. They've had a team of humans doing the task for 9 years, with exceptionally high variance in performance. The blended performance of the human team is meaningfully below this 90% threshold (~76%) – which speaks to the difficulty of the task.

It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.

In this case, we chose an expert that we treat as an objective "source of truth".

re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.

re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.

1 comments

I'm more confused now. If this is a tough and high-value task, we would not use gpt-4o-mini on its own, eg, add more steps like a verifier & retry, or just do gpt-4o to begin with, and would more seriously consider fine-tuning in addition to the prompt engineering. The blog argued against that, but maybe I read too quickly.

And agreed, people expect $ they invest into computer systems to do much better than their bad & avg employees. AI systems get the added challenge where they must do ~100% on what non-AI rules would catch ("why are you using AI?") + extra lift from AI ("what did this add?"). We generally get evaluated on matching experts (low bar), and exceeding them (high bar). Comparing to average staff is, frustratingly, a breakout.

Each scenario is different obviously..

One point of confusion might be that this is a tough but relatively low-value task (on a per-unit basis). The budget per item moderated is measured in small double-digit cents, but there's hundreds of thousands of items regularly being ingested.

FWIW – across all of these, we already do automated prompt rewriting, self-reflection, verification, and a suite of other things that help maximize reliability / quality, but those tokens add up quickly and being able to dynamically switch over to a smaller model without degrading performance improves margin substantially.

Fine-tuning is a non-starter for a number of reasons, but that's a much longer post.

I feel like LLMs are going to be a skill to have similar to the ability to google or type since it can get good answers pretty well but bad answers when you don't know the subject manner.
Agreed, and that's where teams like the OP come in

OpenAI does great at training for general tasks, and we should not be disappointed when specialized tasks fail. Interestingly, openai advertises increasingly many subjects they are special casing like math, code, & law, and so holding them to standards is fair there IMO.

For specialized contexts openai doesn't eval on, these merit hiring consultants / product to add the last-mile LLM data & tuning for the specific task. And at least in my experience, people paying money for AI experts & tech expect expert-level performance to be met, and ultimately, exceeded..