| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by geuis 3710 days ago

I built something almost identical in 2011. It really doesn't have as much utility in practice as you think initially. CSS selectors are an interesting idea for extracting data from pages, but it's extremely fragile. You have to either parse the page's raw html using something like jsdom, or you run it through a headless browser like Phantom. In the first case, it completely fails for any modern SPA (angular, react, etc). In the second case, phantom is painfully slow and difficult to interact with, and often doesn't run/render an SPA as a regular browser does.

You can write tests around whether your selectors are returning data, but even simple refactors from a dev team quickly break your selector profiles multiple times a week or month.

Just wasn't worth the hassle.

3 comments

mickael-kerjean 3709 days ago

There is some solutions to run a SPA in real browser even in a headless environment

The trick is to emulate x11 with xvfb and control it with selenium web driver.

Phantom isn't the only choice, just the one most people talk about

As for non js heavy website, it's fairly trivial to find a library that will parse the dom for you, pretty every language have one

link

NicoJuicy 3709 days ago

Done it also, to scrape HN in cli :p

link

LoSboccacc 3710 days ago

done it as well. at time specialized to organize web comics (it was way before google reader times).

real issue is that popularity will get you blocked fairly quickly. see also: yahoo pipes.

link