| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yobbo 1411 days ago

Many years ago I wrote a scraper-module for a scripting language that exposed a fake DOM to an embedded JS-engine, spidermonkey. The DOM was just an empty object graph, readable both from the scripting language and inside the JS context. The documents were parsed by libxml2 and the resulting DOMs were not identical to mozilla's, for example. But fast and efficient.

The purpose was to enable "live interactive" scraping of forms/js/ajax sites, with a web frontend controlling maybe 10 scrapers for each user. When that project fell through, I stopped maintaining it and the spidermonkey api has long since moved on.

It works for simple sites that don't require the DOM to actually do anything (for example triggering images to load with some magic url). But many simple DOM behaviours can be implemented.