| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lmeyerov 1159 days ago
	Prompt testing, especially when for q/a pairs where there are multiple right answers, has been bugging me a lot The article is reasonable, but also shows a big gap in tooling, as the techniques there feel closer to linting & typing then testing once you do more interesting prompts. They don't check the interesting parts..

1 comments

than3 1159 days ago

> The article seems reasonable but ... closer to linting then testing... they don't check the interesting parts

can you elaborate a bit more on what those interesting parts are?

It could just be a limitation of computation.

link

lmeyerov 1158 days ago

We are helping our users with qa tasks involving code generation, where the answers may be either JSON, executable code, or markdown discussions involving the same. We are tuning for a bunch of tools following that pattern so our users don't have to.

It's easy to make a labeled training set for grading our homework (catching regressions, ...) in the case of classifiers, and that's basically what the blog post showed.

What about for the above qa tasks? We can ask GPT4 whether a generated A was a good answer for a Q, but that's asking it to grade itself. Likewise, in the code case, we can write unit tests for the answers. (Trick: we use the former to more quickly do the latter.) But I feel like there has to be better ways

Another: OpenAI always updates models based on use, so we have to be sure our tests are real holdout sets that never get back to them...

link

than3 1156 days ago

I don't think LLMs are going to be able to solve that. There are a number of things that are assumed are true, but may not necessarily be true. This can potentially lead to multiple possible answers (outputs) given the same inputs.

For example determinism in code, its required for computation and its a system's property, but generalizing a test for it is really hard. Its a property, and by knowing its true or false you can make inferences on whether a system maintains those properties, but most of this is abstracted away at lower levels and since the context can't ever be fully shared with an LLM for evaluation, nor can it automatically switch contexts when evaluation fails, this most likely will never be solveable by computers when there exists one single input that produces two separate (different) outputs, at least from what I know about automata theory and computability.

Its generally considered a class of problems that can't be solved by turing machines.

https://en.wikipedia.org/wiki/Theory_of_computation

https://medium.com/@tarcisioma/limits-of-computation-231bf28... (overview)

https://en.wikipedia.org/wiki/Undecidable_problem (crux of the problem)

link