Hacker News new | ask | show | jobs
by autonomousErwin 308 days ago
I wonder if not just checking the site every day (or minute ) would solve for this.

It's not necessarily the structure of the source data (the DOM, the HTML etc.) but rather the translator that needs to be contractually consistent. The translator in this case is the service for the endpoints.

1 comments

> I wonder if not just checking the site every day (or minute ) would solve for this.

No, because a webpage makes no promise to not change. Even if you check every minute, can your system handle random 1 minute periods of unpredictable behavior? What if they remove data? What if the meaning of the data changes (e.g. instead of a maximum value for some field they now show the average value) how would your system deal with that? What if they are running an A/B test and 10% of your ‘API’ requests return a different page?

This is not a technical problem and the solution is not a technical one. You need to have some kind of relationship with the entity whose data you are consuming or be okay with the fact that everything can just stop working at any random moment in time.

You download the html, hash it with sha512 , then run the Ai and the webscraping and cache the api content

When the cache is invalidated you refetch the html, check the sha512 hash to see if anything changed then proceed based on yes or no

Or something like that. Its not fast but hashing and comparing is fast compared to inference anyways

I’m not sure what that would solve? Your API call is still broken. Best case you’re serving stale data.
That's just part and parcel of relying on third parties - you should always price in the maintenance burden of keeping up with potential changes on their end. That burden is a lot lower if the third party cooperates with you and provides an explicit contract and backwards compatibility, but it's still not zero.
It’s not about the maintenance cost, it’s about continuity of service. If you scrape a website things may break at any time. If you use a proper API and have a contract with the supplier you will have the opportunity to make any changes before things break.