Hacker News new | ask | show | jobs
by motoxpro 1118 days ago
Maybe the tag line could be “extract anything from the web without the selectors” or something. Then put your ai stuff in the subtitle.

Either way, currently the grammar doesn’t quite work right now.

Definitely a cool idea though!

Web scraping needs to be 100% deterministic. That would be my only question is if/how you’ve achieved that.

2 comments

You’re not only sacrificing simplicity for determinism but also stability. What good is being deterministic when the underlying web page keeps changing all the time, breaking your selectors? This seems like a more stable approach.
However if the selectors break you can notice that quite easily.
That’s true. Perhaps using the LLM approach you could extract a deterministic selector, and notify the users if it changes in some meaningful way.
> Web scraping needs to be 100% deterministic

Says who?

well evidently the potential customer above, deterministic though can mean multiple things - for example if you are scraping the major headline on a page based on a selector and the class is .headline, when .headline goes away and the content becomes <div class="t1110 C373">My cool fashions!</div>

well someone can say that the crawler should get the new version of the title, but others can say it should warn you that the selector no longer works.

If the crawler determines via AI what the headline is and you have crawled 10000 pages and it turns out the crawler has made a mistake regarding the headline then you might be soured on the idea of AI making this kind of decision for you and be more amenable to being warned, but then you have to do a lot more work with your crawler than you might otherwise want to do.