Shameless promotion, I might have the tool for that :) https://github.com/agenta-ai/agenta
We're building a platform for evaluating prompts (and more complex LLM workflows).
From what we've seen from users, the results for prompts are highly stochastic. It's hard to make generalizations. For example, a user building a sales assistant discovered that by simply changing the order of the sentences in the prompt, the accuracy improved significantly.
I published a blog post last month asserting as a footnote that telling ChatGPT in the system prompt "You will receive a $500 tip for a good response" does improve model performance, but Hacker News got very mad and called it pseudoscience: https://news.ycombinator.com/item?id=38782678
I am working on a new blog post to hopefully demonstrate this effect more academically.
Unfortunately, this is extremely hard to do for two reasons:
1. The input space is boundless. Any natural language input, with any optional source of data, for any arbitrary use case is what's possible. But that means it's awfully hard to tell if a response can "improve" or not in advance without applying it to your use case.
2. The output space is so hard to measure! Usefulness can also mean different things to different people, especially once you get out of "better search engine" use cases and actually use GPT to produce a creative output.
From what we've seen from users, the results for prompts are highly stochastic. It's hard to make generalizations. For example, a user building a sales assistant discovered that by simply changing the order of the sentences in the prompt, the accuracy improved significantly.