| HN Mirror

Compiling and evaluating output are types of fact checking. We've done more extensive automated evaluations of "groundedness" by extra ting factual statements and seeing whether or not they are based on input data or hallucinated. There are many techniques that work well.

For comparisons, you can ask the model to eval on various axis e.g. reliability, maintainability, cyclometeic complexity, API consistency, whatever, and they generally do fine.

We run multi-trial evals with multiple inputs across multiple semantic and deterministic metrics to create statistical scores we use for comparisons... basically creating benchmark suites by hand or generated. This also does well for guiding development.