| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by irjustin 2058 days ago

Anyone who does scraping or automated browser work eventually comes across XPath.

In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.

After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.

There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

4 comments

masklinn 2058 days ago

> In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.

IMO the learning curve of XPath is not that high though, it has a somewhat alien syntax but the only thing I remember giving me trouble is axis, because most tutorials just go on with the "shortcut" syntax so the first time you encounter axis everything goes pear-shaped.

> There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.

Nokogiri should support function extensions[0] and most of the XPath 2.0 functions were originally extensions to 1.0[1], so even if these functions are not distributed with nokogiri you should be able to add them yourself.

Incidentally, Nokogiri seems to optionally depend on libexslt, which is the exslt implementation in C for libxml2/libxslt, so exslt should be available either as an option or by building it yourself.

[0] https://github.com/sparklemotion/nokogiri/commit/eb56525fbcc...

[1] http://exslt.org

link

mattmanser 2058 days ago

Many moons ago I worked somewhere that used XPath extensively.

Definitely a serious learning curve, some of the developers really struggled with it, others went crazy on it.

I made a pivot table maker with it. It was crazy fast vs the js version I originally tried back in the pre-v8 engine days. The js version would basically die after you got past a trivial amount of data, the xlst one was instant regardless of the amount of data.

link

jinushaun 2058 days ago

I agree. I think the original author completely missed the point and conflates lack of mainstream usage with dead tech. If you never run into problems that xpath addresses, of course you’ll never use xpath. It’s not for everyday use. And certainly shouldn’t be billed as a CSS selector replacement.

link

chriswarbo 2058 days ago

I think their complaints about browser support are fair (orthogonal to whether the newer versions are any good, which most of the comments here are talking about!)

In a self-managed environment, like a PC or server, then you're right that popularity makes little difference.

link

brixon 2058 days ago

Similar, when I have control of the source code then CSS selectors are fine (I can always throw in another ID or Class Name). When I don't have control of the source code then I might have to use XPath if CSS selectors are insufficient.

link

t7s 2058 days ago

If you need to do web scraping learning xpath is very helpful

link