Hacker News new | ask | show | jobs
by jandrese 2105 days ago
There are multiple ways of doing web scraping, and some are definitely more fragile than others.

I've found fully specified XPaths to be a mistake for example. It only takes one tiny change on the page to mess up the script. On the other hand, despite numerous warnings that it would be a disaster I've found I have a lot of luck maintaining regexes, even after major page reworks.

1 comments

Sure, but A) As implemented in code, the path doesn’t necessarily have to be fully explicit in terms of tags. You could look for a child with text containing rather than a specific class or id, for example (or get fancier with semantic similarity on tags or text). B) There’s a trade off between speed and fragility that could make a difference if your tree is deep enough that traversing it iteratively is slow enough compared to a long xpath that it becomes a limiting factor. Granted you don’t typically encounter this in a standard issue html tree but the lxml docs, for example, correctly note that xpaths can be way faster when the nesting is super deep.