Empirical testing of LLM few-shot examples shows that example choice is crucial

Hey there, I'm the founder of a company called Libretto, which is building tools to automate prompt engineering, and I wanted to share this blog post we just put out about empirical testing of few-shot examples.

We took a prompt from Big Bench and created a few dozen variants of our prompt with different few-shot examples, and we found that there was a 19 percentage point difference between the worst and best set of few-shot examples. Funnily, the worst-performing set was when we used examples that all happened to have a one word answer, and the LLM seemed to learn that replying with one word answers was more important than actually being accurate. Sigh.

Moral of the story: which few shot examples you choose matters, sometimes by a lot!