Hacker News new | ask | show | jobs
by andai 73 days ago
Is there a benchmark for these long tasks? That kind of seems like the only number worth measuring.

(Of course at that point it involves memory and context management and so on, so you're testing the harness as well as the model.)