| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephantul 37 days ago
	This is a bit rude. We didn't generate this project, we wrote it, a lot of it manually, and trained custom models. We'd been working in the real-time retrieval space for a while, and we thought coding was a good fit for this specific technology.

1 comments

esperent 37 days ago

My comment above wasn't meant to be rude. And you do have extensive benchmarks against grep etc so it's clear you understand the importance of that.

But I still think you're missing the harder but more important proof which is agent evals. Have you done any of that?

I would personally love to find tools in this space which can make agents more efficient and I do believe there's a scope for massive improvements compared to default workflows. But my evals with RTK and Headroom have made me wary that a tool can look like it should work, conceptually make sense, pass non-agentic benchmarks, and still make an actual agentic workflow worse.

link

stephantul 37 days ago

It was directed at the parent who implied that we didn’t think about this.

I agree with your point about the evals and how you can get discontinuities: good search can be worse than bad search when agents can do many searches. We’re working on it

link

esperent 37 days ago

When you share them, please also share the setup for people to easily rerun them. Nearly every eval I've seen shares the llm session transcript but not the actual harness setup etc. that they used.

link