Hacker News new | ask | show | jobs
by nihit-desai 1100 days ago
Hi, one of the authors here. Good question! For this benchmarking, we evaluated performance on popular open source text datasets across a few different NLP tasks (details in the report).

For each of these datasets, we specify task guidelines/prompts for the LLM and human annotators, and compare each of their performance against ground truth labels.

2 comments

You didn't answer the question at all, although to be fair the answer is both obvious and completely undermines your claim so I can see why you wouldn't.
>compare each of their performance against supposed ground truth labels.

Fixed it for you.

I mean, sure. For ground truth, we are using the labels that are part of the original dataset: * https://huggingface.co/datasets/banking77 * https://huggingface.co/datasets/lex_glue/viewer/ledgar/train * https://huggingface.co/datasets/squad_v2 ... (exhaustive set of links at the end of the report).

Is there some noise in these labels? Sure! But the relative performance with respect to these is still a valid evaluation

Agreed, thanks for highlighting these links!