| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fatso784 1072 days ago
	You’re missing the point here. It’s not even getting the LLM’s opinion on evaluating the responses to the prompts (which itself is fraught for some tasks, and benchmarks are known to be limited —even OpenAI admits this, it’s why they made evals). It’s one level abstracted from that. It’s evaluating what the LLM thinks of how well the prompt will do, in purely hypothetical terms. That’s hogwash —different LLMs perform very differently even for the same prompts. Try any tool that lets you compare model responses side-by-side. Unless I see actual use cases, this is yet another iteration of overtrusting AI. Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool: https://news.ycombinator.com/item?id=35660751

1 comments

duskwuff 1071 days ago

> Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool.

I was reminded of the same thing. What a lot of it boils down to is that LLMs have no innate ability to self-reflect. They can pretend to do it, but no more effectively than an untrained human would.

link

ChikkaChiChi 1071 days ago

> They can pretend to do it, but no more effectively than an untrained human would.

Which is exactly as much as Generative AI should be trusted.

link