You can learn about them evaluating that site https://github.com/Alexhans/eval-ception and then the pattern should be easy to test on your own thing.
What do you think would resonate with you or with the audience you're thinking about?
That repo also has an illustrative eval for Agent Skill in Airflow for Localization
https://github.com/Alexhans/eval-ception/tree/main/exams/air...
The question I have is: what are we optimizing for and how do we measure it?
In your own repos, I see you have a fork of safepass, which seems like a nice simple project, but it doesn't have an agents file yet.