| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jawns 1205 days ago

How does crul handle the dynamic nature of the web?

Yes, content changes, but so does structure. If I'm interested in content that shows up in a news feed div, and that div is renamed or moved as part of a site redesign, what happens?

I've worked on a bunch of tools in the past that do similar things, and structural changes were the kryptonite for all of them.

A secondary problem is when you use particular content as a reference point, and that content is later updated. Now your reference point is gone!

1 comments

portInit 1205 days ago

At this point we're considering it a foundational concept to build around - web content changes, so our best option currently is to make the query as easy as possible to change, and alert when things break.

We have done some preliminary work in some AI or other intelligence for pattern recognition to be able to handle structural changes better, but still have lots of work.

But the expanding and querying concepts also make a lot of sense with APIs, which tend to be a little more stable.

link

robbs 1205 days ago

IMO, this is the hardest part of maintaining a web scraper. We had ~100 scripts to scrape ~1000 clients' sites and it was, at minimum, 50 hours a week to keep up with changes.

The second hardest part was 30% of our clients all used the same hosting provider, which would start to fail at 10-20 req/s. We had to throttle the sites by IP, cluster-wide.

link

portInit 1205 days ago

This makes sense and I am curious about this. Was there consistency between those 1k client sites or were they all rather different? Mind if I reach out?

link