| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by specialp 1018 days ago
	Great accuracy as tested to a continually changing black box. GPT hits are also expensive and often have unpredictable latency. This would have to be integration tested to detect changes to GPT answers.

1 comments

adventured 1018 days ago

Correct me if I'm wrong, you can pick which dated GPT API to utilize and expect that to not act as a continually changing black box. I've been using the API for a long time and have been able to pick the version.

So for example: gpt-4-0314, or gpt-3.5-turbo-0613, etc.

The latency issue is definitely true. Ideally the cost could be limited to a very small percentage of hard cases (which you first have to identify).

link

eatonphil 1018 days ago

LLMs don't seem to be deterministic [0, 1, 2, 3]. So no, pinning the version wouldn't be enough.

[0] https://matt-rickard.com/foundational-models-are-not-enough

[1] https://arxiv.org/pdf/2308.02828.pdf

[2] https://www.sitation.com/non-determinism-in-ai-llm-output/

[3] https://towardsdatascience.com/the-magic-of-llms-prompt-engi...

link

adventured 1018 days ago

> So no, pinning the version wouldn't be enough.

You can to an extent dictate GPT's determinism with settings you can pass along in the API, combined with the parent already proclaiming they saw a 100% success rate.

So how do you know it wouldn't be enough? The parent is already saying their test suite indicates it is enough. What tests have you run counter to their claim to show it fails? And how do you know the parent can't increase the determinism even further beyond what they were already using in their testing (and decreasing the risk of negative outcomes by doing so)?

link