| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by isoldex 33 days ago

Hi HN - author of Sentinel here.

I built Sentinel after using browser-use and Stagehand on a client project and hitting two recurring issues: flaky reliability on multi-step flows, and token costs that ate the budget on anything non-trivial. I suspected the root cause was architectural - both lean on the LLM re-reading large portions of the page each step - and tried Chrome's Accessibility Object Model (AOM) as the observation layer instead.

To check whether that architectural choice actually mattered, I built a 9-task benchmark comparing Sentinel, Stagehand, and browser-use against the same Gemini 3 Flash Preview model, same prompts, same programmatic validators, 5 runs per task-tool combo. Raw per-run JSON is committed so you can recompute or challenge every number.

Headline numbers: - Tokens: Sentinel uses 3.1x-56.9x fewer than browser-use, 1.4x-13.3x fewer than Stagehand. - Reliability: Sentinel 100% (45/45), browser-use 100% (45/45), Stagehand 86.7% (39/45). - Speed: Sentinel is fastest on 5 of 9 tasks. - The harder the task, the bigger the token gap.

Caveats up front: - I built Sentinel - treat this as a starting point for your own verification, not an impartial survey. README has a full known-limitations section. - Single model (Gemini 3 Flash Preview, which is also Stagehand's documented recommendation). - 9 tasks is small; raw JSON is there if you want to add tasks or rerun on a different model. - Each framework is used with its idiomatic API (Sentinel/Stagehand: discrete act()/extract(); browser-use: agent-loop prompt). Forcing them into the same call pattern would disadvantage whichever is optimized for the other.

Sentinel is already in production with paying clients (all self-hosted), which covers development costs. A managed offering is on the table if there's real demand: you'd pay infra + model usage at cost, no margin. Drop a comment if that would unblock you, otherwise I'd rather not maintain hosting nobody needs.