| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by intended 1014 days ago

An LLM based tool is very good for any case where 1) User is an SME 2) Generated output can be verified by user easily.

After that its just a gradual creep into LLM ops and madness. Speaking from the other side of that descent into madness.

As obvious as it may be, production LLM tools work on your data. You can't simply use an external benchmark to verify if your tool works for your use case. You will always have to build evaluation processes.

I'd say there are 2 type of tests you will end up running.

1) Statistical Tests - AKA good old ML. 2) Semantic Tests - Here be dragons.

Semantic tests break down further based on HOW you are using the LLM. (Categorization, Summarization)

The issue with Semantic testing is the amount of human effort. Its more akin to setting up exams and evaluating answers. Also your student may be tripping randomly.

Categorization - you can simplify it down to almost ML workflows. Summarization ? That takes effort to verify.