|
|
|
|
|
by kevc
214 days ago
|
|
It feels like we are pretty far away from LLMs running a concession stand (see andon labs) so not surprised it would struggle here. Still the failure modes are super interesting and having benchmarks seems to be the starting point to domain-specific improvements. |
|