Hacker News new | ask | show | jobs
by loumaciel 108 days ago
Happy to answer questions about the sandboxing, artifact format, or the benchmark setup.

The benchmark harness and datasets are in the repo if anyone wants to reproduce or extend the tests. Curious if others have run into the same context compaction issues with tool-heavy agents.