Hacker News new | ask | show | jobs
by jawns 1205 days ago
How does crul handle the dynamic nature of the web?

Yes, content changes, but so does structure. If I'm interested in content that shows up in a news feed div, and that div is renamed or moved as part of a site redesign, what happens?

I've worked on a bunch of tools in the past that do similar things, and structural changes were the kryptonite for all of them.

A secondary problem is when you use particular content as a reference point, and that content is later updated. Now your reference point is gone!

1 comments

At this point we're considering it a foundational concept to build around - web content changes, so our best option currently is to make the query as easy as possible to change, and alert when things break.

We have done some preliminary work in some AI or other intelligence for pattern recognition to be able to handle structural changes better, but still have lots of work.

But the expanding and querying concepts also make a lot of sense with APIs, which tend to be a little more stable.

IMO, this is the hardest part of maintaining a web scraper. We had ~100 scripts to scrape ~1000 clients' sites and it was, at minimum, 50 hours a week to keep up with changes.

The second hardest part was 30% of our clients all used the same hosting provider, which would start to fail at 10-20 req/s. We had to throttle the sites by IP, cluster-wide.

This makes sense and I am curious about this. Was there consistency between those 1k client sites or were they all rather different? Mind if I reach out?