Hacker News new | ask | show | jobs
by shayanjm 1009 days ago
Interestingly we initially thought that prompt length would play a big factor in the performance of this approach. In practice, though, we discovered that it's actually not as big a factor as we predicted. For instance, Prompt #3 was 410 tokens long, while Prompt #5 was only 88 tokens. The estimation for Prompt #3 aligned fairly well with the IG approach (0.746 cosine similarity, 0.643 pearson correlation), while the estimation for Prompt #5 seemed to underperform (0.55 cosine similarity, 0.295 pearson correlation). Meanwhile, Prompt #2 was 57 tokens long and performed quite well (0.852 cosine similarity, 0.789 pearson correlation).

Re: our definitions of average/long/short prompts -- we weren't really rigorous with those definitions. In general, we considered anything under 100 tokens "short", 100-300 average, and 300+ large.

Our intuition here is that the relationship between performance of the estimation and the prompt structure is less about length, and more about "ambiguity". Again, we don't really have a rigorous definition of that yet, but it's something we are working on. If you take a look at the prompts in the analysis notebook you might get a sense of what I mean: prompts 1-3 are pretty straight forward and mechanical. Prompts 4 & 5 are a bit more open to interpretation. We see performance of the estimation degrade as prompts become more and more open to interpretation.

1 comments

Oh, it’s definitely ambiguity. Any given token’s attention is going to have its weight vary based on its context, and less-ambiguous terms are more likely to be used “near” the other terms that matter. For example, if you tell GPT not to ‘omit’ code from a code sample, it has to disambiguate the meaning of omit. Tell it not to ‘elide’ any code, and it performs a lot better. “Prompt engineering” is far more linguistic than people seem to realize. It’s not just “say what you mean” when the model has an easier time when you “say what you mean in the most linguistically precise way possible”. Simplified, but workable: it’s a matter of finding less ambiguous/more context-specific tokens/words with a better tf/idf in the pre-training corpus without getting too esoteric.

Another example: storytelling prompts that include “I dislike open-ended conclusions and other rhetorical hooks” often results in fewer (or no) closing statements like, “as night fell, they wondered about their future.”

Edit: GPT-4 is surprisingly good at answering these things if asked to: https://chat.openai.com/share/b97ad65f-f005-49b4-a64e-eb537d...