|
|
|
|
|
by pshc
5324 days ago
|
|
I was scraping with jQuery for a while but it felt like an awful lot of overhead. In the case of simpler scraping tasks that happen a lot I've actually gone back to nuts and bolts with HTML5[1]'s tokenizer and a custom state machine that only accumulates the data I want. At no time is any DOM node actually created in memory, let alone the entire DOM tree. It means I feel safer running many of these in parallel on a VPS. It also means I can write a nice streaming API where you start emitting data the moment you get enough input. Buffering input just feels wrong in node.js. But jQuery is a great scraper if your transformation is complex and non-streamable.
[1] https://github.com/aredridel/html5 |
|