|
|
|
|
|
by nihit-desai
1100 days ago
|
|
Hi, one of the authors here. Good question! For this benchmarking, we evaluated performance on popular open source text datasets across a few different NLP tasks (details in the report). For each of these datasets, we specify task guidelines/prompts for the LLM and human annotators, and compare each of their performance against ground truth labels. |
|