|
|
|
|
|
by Madmallard
237 days ago
|
|
You're going to have to put quotes around "fact checking" if you're using LLMs to do it. "comparing different approaches, summarizing and effectively utilizing the "wisdom of the crowd" (and it's success over time)" I fail to see how this is defensible as well. |
|
For comparisons, you can ask the model to eval on various axis e.g. reliability, maintainability, cyclometeic complexity, API consistency, whatever, and they generally do fine.
We run multi-trial evals with multiple inputs across multiple semantic and deterministic metrics to create statistical scores we use for comparisons... basically creating benchmark suites by hand or generated. This also does well for guiding development.