|
|
|
|
|
by istinetz
87 days ago
|
|
very neat! how do you combat silent failures? for example, I am scraping website A, getting 500+ pdf files; then they change their layout, the ETL breaks, we autoregenerate it with Claude, but then we get only 450 PDFs. The orchestrator still marks it as a successful run, but we get only part of the data. Or: the ETL for website B breaks. We use our agentic solution, we successfully repair it, and it completes without errors, but we start missing a few fields that were moved in another sub-page. Did you encounter any such issues? |
|
Quick clarification: the AI agent writes the config once and is out of the loop after that. You run crawls yourself or via cron. So the "auto-regenerate and silently get wrong data" scenario doesn't quite apply since there's no agent in the runtime loop.
But configs going stale is a real problem. Two things help:
1. The agent tests on 5 real pages before saving any config. Empty fields = rewrite before it hits production.
2. `./scrapai health --project <n>` tests all your spiders and flags extraction failures. We run it monthly via cron. Broken spider? Point the agent at it, it re-analyzes and fixes.
The gap: result count drops (your 500 to 450 example). Health checks catch broken extraction, not "fewer pages matched." We list structural change detection as an open contribution area in the README.