| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PaulHoule 974 days ago
	Did you compile accuracy, F1 numbers, or anything like that? Do you have quantitative comparisons of results you got w/ different models?

1 comments

jarulraj 974 days ago

As we do not have ground truth, we only qualitatively checked for accuracy -- no quantitative metrics. We did find a significant drop in accuracy with GPT 3.5 as opposed to GPT 4.

Are you measuring accuracy with data wrangling prompts? Would love to learn more about that.

link

PaulHoule 974 days ago

Everything I do now is classification and AUC-ROC is my metric. For your problem my first thought is an up-down accuracy metric, but the tricky problem you might have is "do you accept both 'United States' and 'USA' as a correct answer?" and the trouble dealing with that is one reason I stick to classification problems.

I'm skeptical of any claim that "A works better than B" without some numbers to back it up.

link