|
|
|
|
|
by ricardobeat
397 days ago
|
|
> But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this. |
|