|
|
|
|
|
by andy99
218 days ago
|
|
I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical. The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI? I think both of these takes are present to some extent in reality. |
|
We struggle a bit with processing and extracting this kind of insight in a privacy-friendly way, but there’s certainly a lot of data.