Hacker News new | ask | show | jobs
by gt50201 2236 days ago
In a previous company we were scraping customers web pages to load customer and product information because the number of systems and teams involved to connect to their CDP would take months to get up and running. A few problems we had to solve was the web pages changing underneath us and some pages not being formatted the same. We ended up having to detect those changes server-side to alert us to update the scraper. How do you all deal with these use-cases?

Also, we tried to leverage: https://www.diffbot.com/ but the lack-of-accuracy/lack-of-complete-data + cost never justified it's usage.

1 comments

Yes, very good question (I've answered a similar one on detecting whether the page has finished loading).

It's a deceptively hard problem. Essentially, what we do is fingerprint the element. If the page changes, it boils down to how effective our fingerprinting and search algorithms are, to find the element if it has moved or changed.

The algorithms behind that are good enough for most use-cases now, but it's something we're continuously iterating on.