As we do not have ground truth, we only qualitatively checked for accuracy -- no quantitative metrics. We did find a significant drop in accuracy with GPT 3.5 as opposed to GPT 4.
Are you measuring accuracy with data wrangling prompts? Would love to learn more about that.
Everything I do now is classification and AUC-ROC is my metric. For your problem my first thought is an up-down accuracy metric, but the tricky problem you might have is "do you accept both 'United States' and 'USA' as a correct answer?" and the trouble dealing with that is one reason I stick to classification problems.
I'm skeptical of any claim that "A works better than B" without some numbers to back it up.
Are you measuring accuracy with data wrangling prompts? Would love to learn more about that.