Y
Hacker News
new
|
ask
|
show
|
jobs
by
imbusy111
203 days ago
Funny to see tau2-bench on the list of benchmarks, when tau2-bench is flawed and 100% score is impossible, unless you add the tasks to the training set:
https://github.com/sierra-research/tau2-bench/issues/89