| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by skybrian 119 days ago
	Okay, but how would I write evals for my project's agents file? Any good examples out there?

2 comments

alexhans 118 days ago

I wrote https://ai-evals.io (community site) to make the concept approachable no matter what tools you choose to use.

You can learn about them evaluating that site https://github.com/Alexhans/eval-ception and then the pattern should be easy to test on your own thing.

link

skybrian 118 days ago

Doing an eval on itself is clever but confusing for the reader. How about a tutorial explaining how to do an evals on something more normal?

link

alexhans 118 days ago

I'd be happy to. One thing that is tough is knowing what will resonate with the audience and not being too simple or too complex.

What do you think would resonate with you or with the audience you're thinking about?

That repo also has an illustrative eval for Agent Skill in Airflow for Localization

https://github.com/Alexhans/eval-ception/tree/main/exams/air...

link

skybrian 118 days ago

How about taking a small, real open source project that has an AGENTS.md and showing how to add evals and optimize it?

The question I have is: what are we optimizing for and how do we measure it?

In your own repos, I see you have a fork of safepass, which seems like a nice simple project, but it doesn't have an agents file yet.

link

popey 118 days ago

The agents are smart enough to write the evals too.

It's agents all the way down!

Submit a GitHub repo containing skills to Tessl, and it will generate the evals, run them, and present the results. https://tessl.io/registry/skills/submit

The evals and results are all shown, no login necessary, so you can assess them yourself. e.g. https://tessl.io/registry/skills/github/coreyhaines31/market... (click details to see the eval texts).

link

skybrian 118 days ago

At first glance this looks like an entire ecosystem full of slop and by running that eval you generate more? I'm looking for something a bit more curated.

link

sjmaplesec 117 days ago

No, the context can be human created as much as it could be llm generated. The suggestions are based on Anthropic best practices and allow the agents to activate, and use the skills better, make the text clearer for the agent etc.

link