Hacker News new | ask | show | jobs
by irjustin 2058 days ago
Anyone who does scraping or automated browser work eventually comes across XPath.

In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.

After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.

There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

4 comments

> In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

IMO the learning curve of XPath is not that high though, it has a somewhat alien syntax but the only thing I remember giving me trouble is axis, because most tutorials just go on with the "shortcut" syntax so the first time you encounter axis everything goes pear-shaped.

> There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

Nokogiri should support function extensions[0] and most of the XPath 2.0 functions were originally extensions to 1.0[1], so even if these functions are not distributed with nokogiri you should be able to add them yourself.

Incidentally, Nokogiri seems to optionally depend on libexslt, which is the exslt implementation in C for libxml2/libxslt, so exslt should be available either as an option or by building it yourself.

[0] https://github.com/sparklemotion/nokogiri/commit/eb56525fbcc...

[1] http://exslt.org

Many moons ago I worked somewhere that used XPath extensively.

Definitely a serious learning curve, some of the developers really struggled with it, others went crazy on it.

I made a pivot table maker with it. It was crazy fast vs the js version I originally tried back in the pre-v8 engine days. The js version would basically die after you got past a trivial amount of data, the xlst one was instant regardless of the amount of data.

I agree. I think the original author completely missed the point and conflates lack of mainstream usage with dead tech. If you never run into problems that xpath addresses, of course you’ll never use xpath. It’s not for everyday use. And certainly shouldn’t be billed as a CSS selector replacement.
I think their complaints about browser support are fair (orthogonal to whether the newer versions are any good, which most of the comments here are talking about!)

In a self-managed environment, like a PC or server, then you're right that popularity makes little difference.

Similar, when I have control of the source code then CSS selectors are fine (I can always throw in another ID or Class Name). When I don't have control of the source code then I might have to use XPath if CSS selectors are insufficient.
If you need to do web scraping learning xpath is very helpful