| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by furyofantares 225 days ago
	That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

2 comments

AdieuToLogic 225 days ago

In the event this comment is slathered in sarcasm:

  Well done!  :-D

link

ht96 225 days ago

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

link

AdieuToLogic 225 days ago

There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

link

cantor_S_drug 225 days ago

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

link

aenis 225 days ago

For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.

link

saturatedfat 225 days ago

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)

link