| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sally_glance 104 days ago
	This is the hard part - especially with larger initiatives, it takes quite a bit of work to evaluate what the current combination of harness + LLM is good at. Running experiments yourself is cumbersome and expensive, public benchmarks are flawed. I wish providers would release at least a set of blessed example trajectories alongside new models. As it is, we're stuck with "yeah it seems this works well for bootstrapping a Next.js UI"...