Hacker News new | ask | show | jobs
by operatingthetan 63 days ago
Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.