Hacker News new | ask | show | jobs
by alecco 872 days ago
It would be good to have some actual analysis of how each of these prompt features improves responses.
3 comments

Shameless promotion, I might have the tool for that :) https://github.com/agenta-ai/agenta We're building a platform for evaluating prompts (and more complex LLM workflows).

From what we've seen from users, the results for prompts are highly stochastic. It's hard to make generalizations. For example, a user building a sales assistant discovered that by simply changing the order of the sentences in the prompt, the accuracy improved significantly.

I published a blog post last month asserting as a footnote that telling ChatGPT in the system prompt "You will receive a $500 tip for a good response" does improve model performance, but Hacker News got very mad and called it pseudoscience: https://news.ycombinator.com/item?id=38782678

I am working on a new blog post to hopefully demonstrate this effect more academically.

Unfortunately, this is extremely hard to do for two reasons:

1. The input space is boundless. Any natural language input, with any optional source of data, for any arbitrary use case is what's possible. But that means it's awfully hard to tell if a response can "improve" or not in advance without applying it to your use case.

2. The output space is so hard to measure! Usefulness can also mean different things to different people, especially once you get out of "better search engine" use cases and actually use GPT to produce a creative output.

everyone is using LLM as a judge to fix unbound output eval
No, not everyone is. It's one of the ways you can do this, though.
Maybe I exaggerated a bit, but there are many papers today going this route.