Hacker News new | ask | show | jobs
by neilv 2151 days ago
This looks like it might come in handy.

I started working with Web-scraping roughly around '95 (initially for a personalized newspaper metaphor for Web software agent reporting), and wrote HtmlChewer, an HTML parser in Java designed for that purpose. A while later, I moved my rapid R&D work to Scheme, where I wrote the `htmlprag` permissive parser, now known as the `html-parsing` package in Racket and other Scheme dialects.

By the time I was using Scheme, my scraping usually ended up starting with XPath, to get a starting point subtree of the DOM, then used a mix of arbitrary code and sometimes a proprietary pattern-based destructuring DSL, to extract info from the subtree. And sometimes filtering/transformation algorithms across a big free-form-ish text subtree (e.g., for simplifying the articles of a site a custom crawler scraped, for building a labeled corpus for an ML research project).

Of course we've always had resilience problems for Web scraping, even as the Web changed dramatically.

In general, my scraping methods usually ends up hand-crafted (and this was starting before in-browser development tools with element pickers and DOM editors), and much of the guesswork/art of it was in coming up with queries and transforms that seemed like they might keep working the next time the site changed its HTML. In 2004 I did make a small tool to automate a "starting point" for hand-crafting such an XPath query: https://www.neilvandyke.org/racket/webscraperhelper/

1 comments

> In general, my scraping methods usually ends up hand-crafted

I've tried many of these xpath generators and even built few myself. There's still nothing that matches human built ones. Best selectors and most stable selectors are context aware. For example to get a comment text a human would build a css selector: `.article .comments-box .comment p::text` and there's no way without AI's involvement or some big-sample training for the generator to know this object relation structure.

This becomes especially noticeable when parsing complex webpages that can be highly dynamic and with their html. While the tree structure is often unstable the core object relationship almost always is, in other words comment text will always be under comment paragraph, under comment box, under article.