You’re not only sacrificing simplicity for determinism but also stability. What good is being deterministic when the underlying web page keeps changing all the time, breaking your selectors? This seems like a more stable approach.
well evidently the potential customer above, deterministic though can mean multiple things - for example if you are scraping the major headline on a page based on a selector and the class is .headline, when .headline goes away and the content becomes
<div class="t1110 C373">My cool fashions!</div>
well someone can say that the crawler should get the new version of the title, but others can say it should warn you that the selector no longer works.
If the crawler determines via AI what the headline is and you have crawled 10000 pages and it turns out the crawler has made a mistake regarding the headline then you might be soured on the idea of AI making this kind of decision for you and be more amenable to being warned, but then you have to do a lot more work with your crawler than you might otherwise want to do.
I find it encouraging that the homepage isnt AI generated, maybe the code also isn't, and the project may live more than a month before requiring a rewrite!
Would you run the LLM extractor across every page? Especially for larger scale projects, such as scraping entire product catalogues, this sounds very expensive. Maybe you could use the AI to generate selectors from examples that can then be applied to all other pages of the same structure?
This will definitely take away the burden of clients (mostly non-technical people) having to choose the selectors. I've had a scraping service business recently for this specific reason. I hope AI can be a great middle player here. Let's see how it turns out.
Good luck Kai.
I don't think this community agrees with you that web scraping is inherently unethical.
In fact I think many (most?) in this community would argue that web scraping is an almost fundamental feature of the web itself, and that attempts at preventing it are more unethical than scraping.
What's wrong with proxy rotation? Big Tech attempts to prevent any scraping of their content whatsoever. In the context of that web, proxy rotation is table stakes.
When scraping websites, it’s often necessary to change your IP address to bypass the website’s anti-scraping measures. To achieve this, there are proxy services out there that are designed with web scraping in mind- so it’s easy to programmatically change your IP address from within a scraper program.
You basically switch out the proxy you use to scrape. Services by Google or others can identify scrapers cause they'll use the same proxy to request paged