| Hi HN - author of Sentinel here. I built Sentinel after using browser-use and Stagehand on a client
project and hitting two recurring issues: flaky reliability on
multi-step flows, and token costs that ate the budget on anything
non-trivial. I suspected the root cause was architectural - both
lean on the LLM re-reading large portions of the page each step -
and tried Chrome's Accessibility Object Model (AOM) as the
observation layer instead. To check whether that architectural choice actually mattered, I
built a 9-task benchmark comparing Sentinel, Stagehand, and
browser-use against the same Gemini 3 Flash Preview model, same
prompts, same programmatic validators, 5 runs per task-tool combo.
Raw per-run JSON is committed so you can recompute or challenge
every number. Headline numbers:
- Tokens: Sentinel uses 3.1x-56.9x fewer than browser-use,
1.4x-13.3x fewer than Stagehand.
- Reliability: Sentinel 100% (45/45), browser-use 100% (45/45),
Stagehand 86.7% (39/45).
- Speed: Sentinel is fastest on 5 of 9 tasks.
- The harder the task, the bigger the token gap. Caveats up front:
- I built Sentinel - treat this as a starting point for your own
verification, not an impartial survey. README has a full
known-limitations section.
- Single model (Gemini 3 Flash Preview, which is also Stagehand's
documented recommendation).
- 9 tasks is small; raw JSON is there if you want to add tasks
or rerun on a different model.
- Each framework is used with its idiomatic API (Sentinel/Stagehand:
discrete act()/extract(); browser-use: agent-loop prompt).
Forcing them into the same call pattern would disadvantage
whichever is optimized for the other. Sentinel is already in production with paying clients (all
self-hosted), which covers development costs.
A managed offering is on the table
if there's real demand: you'd pay infra + model usage at cost, no
margin. Drop a comment if that would unblock you, otherwise I'd
rather not maintain hosting nobody needs. |