|
|
|
|
|
by satisfice
108 days ago
|
|
It’s called testing. And from the reports and comments, there doesn’t seem to be much of it happening. The reason is: it’s quite expensive to do well. I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text. |
|
Llms are like vendor code you don't need to test them yourself people already created benchmarks for that.