Hacker News new | ask | show | jobs
by istinetz 87 days ago
very neat!

how do you combat silent failures?

for example, I am scraping website A, getting 500+ pdf files; then they change their layout, the ETL breaks, we autoregenerate it with Claude, but then we get only 450 PDFs. The orchestrator still marks it as a successful run, but we get only part of the data.

Or: the ETL for website B breaks. We use our agentic solution, we successfully repair it, and it completes without errors, but we start missing a few fields that were moved in another sub-page.

Did you encounter any such issues?

1 comments

Thanks for the comment, great question.

Quick clarification: the AI agent writes the config once and is out of the loop after that. You run crawls yourself or via cron. So the "auto-regenerate and silently get wrong data" scenario doesn't quite apply since there's no agent in the runtime loop.

But configs going stale is a real problem. Two things help:

1. The agent tests on 5 real pages before saving any config. Empty fields = rewrite before it hits production.

2. `./scrapai health --project <n>` tests all your spiders and flags extraction failures. We run it monthly via cron. Broken spider? Point the agent at it, it re-analyzes and fixes.

The gap: result count drops (your 500 to 450 example). Health checks catch broken extraction, not "fewer pages matched." We list structural change detection as an open contribution area in the README.