Hacker News new | ask | show | jobs
by shubhamintech 104 days ago
The conversation around measuring task duration misses what most product teams actually care about: not can the agent complete a long autonomous run, but are users getting value?

The signal that matters for shipped products is different, what are users trying to accomplish, where do they give up mid-conversation, what does the agent consistently fail at from the user's perspective? Task duration is a capability benchmark. Intent and drop-off analytics are product health metrics.

Most teams building AI agents right now are flying completely blind on the latter. They have LLM observability (latency, token cost, evals) but zero visibility into user behavior patterns inside their agent. Those are two very different problems with two very different buyers.