Hacker News new | ask | show | jobs
by ilaksh 1419 days ago
Stop using regular expressions. Honestly, it sounds like you were having trouble with that, and blamed JavaScript.

Regular expressions are the worst.

Also if someone scraped web pages and then stored the HTML, that seems like a bad idea. You would want to actually extract the data at that point.

Also regular expressions are the worst way to extract data from HTML.

I think the real problem is the business model and technical approach. It almost sounds like you are scraping a massive number of websites using regexes, which is guaranteed to break somewhere on an almost daily basis since developers change the HTML and regexes are very brittle.

You are probably taking advantage of a lot of other people's work, possibly quite a bit of it copyrighted and certainly most of it NOT curated with the idea of some other company profiting off of it.

So you deserve your pain.

But don't blame JavaScript or lack of familiarity with new versions when it's all of your other life choices that are the real problem.

1 comments

I would store the HTML as well for later purposes / hoarding