| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alecco 872 days ago
	It would be good to have some actual analysis of how each of these prompt features improves responses.

3 comments

resiros 872 days ago

Shameless promotion, I might have the tool for that :) https://github.com/agenta-ai/agenta We're building a platform for evaluating prompts (and more complex LLM workflows).

From what we've seen from users, the results for prompts are highly stochastic. It's hard to make generalizations. For example, a user building a sales assistant discovered that by simply changing the order of the sentences in the prompt, the accuracy improved significantly.

link

minimaxir 871 days ago

I published a blog post last month asserting as a footnote that telling ChatGPT in the system prompt "You will receive a $500 tip for a good response" does improve model performance, but Hacker News got very mad and called it pseudoscience: https://news.ycombinator.com/item?id=38782678

I am working on a new blog post to hopefully demonstrate this effect more academically.

link

phillipcarter 872 days ago

Unfortunately, this is extremely hard to do for two reasons:

1. The input space is boundless. Any natural language input, with any optional source of data, for any arbitrary use case is what's possible. But that means it's awfully hard to tell if a response can "improve" or not in advance without applying it to your use case.

2. The output space is so hard to measure! Usefulness can also mean different things to different people, especially once you get out of "better search engine" use cases and actually use GPT to produce a creative output.

link

visarga 872 days ago

everyone is using LLM as a judge to fix unbound output eval

link

phillipcarter 871 days ago

No, not everyone is. It's one of the ways you can do this, though.

link

visarga 871 days ago

Maybe I exaggerated a bit, but there are many papers today going this route.

link