Hacker News new | ask | show | jobs
by jauntywundrkind 25 days ago
The fine tuning where we run tests/experiments again and again and again on our prompts, our set-ups: really looking forward to when we can start to compare our amalgamated rigs and harnesses and prompts, all these systems. We are guided by intuition, a desire for structure & clarity & direction we think we add. But we lack common tools to assess and compare.

And even when we do compare, the thermal values, the entropy of our systems: that alone can lead us down very different paths. Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

1 comments

> we lack common tools to assess and compare

This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.

> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise