| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nihit-desai 1100 days ago
	Hi, one of the authors here. Good question! For this benchmarking, we evaluated performance on popular open source text datasets across a few different NLP tasks (details in the report). For each of these datasets, we specify task guidelines/prompts for the LLM and human annotators, and compare each of their performance against ground truth labels.

2 comments

morelisp 1100 days ago

You didn't answer the question at all, although to be fair the answer is both obvious and completely undermines your claim so I can see why you wouldn't.

link

natch 1100 days ago

>compare each of their performance against supposed ground truth labels.

Fixed it for you.

link

nihit-desai 1100 days ago

I mean, sure. For ground truth, we are using the labels that are part of the original dataset: * https://huggingface.co/datasets/banking77 * https://huggingface.co/datasets/lex_glue/viewer/ledgar/train * https://huggingface.co/datasets/squad_v2 ... (exhaustive set of links at the end of the report).

Is there some noise in these labels? Sure! But the relative performance with respect to these is still a valid evaluation

link

natch 1100 days ago

Agreed, thanks for highlighting these links!

link