Hacker News new | ask | show | jobs
by berkes 1746 days ago
I'd like to add my support here too, but with a note.

When scraping and parsing (or writing integration test DSL), I always start out with CSS selectors. But always hit cases where they lack or require hoop-jumping and then fall back on Xpath. I then have a codebase with both CSS-Sel and Xpath, which is arguably worse then having only one method.

I suspect here, one uses this tool untill CSS selector limitations are getting in the way, after which one switches to another tool(chain)

2 comments

I've not had much friction using either, they are "close enough" that the time to (re)write a query from one to the other is not very significant.
Do you mind giving an example? I'm having trouble following where CSS is limited for selection.
XPath does general data processing not just selection

E.g. when you have a list of numbers on the website, XPath can calculate the sum or the maximum of the numbers

Or you have a list of names "Last name, First name", then you can remove the last name and sort the first names alphabetically. Or count how often each name occurs and return the most popular name.

Then it goes back to selection, e.g. select all numbers that are smaller than the average. Or calculate the most popular name, then select all elements containing that name

Like other commentor says: parent/child. But also selecting by content (e.g. "click the button with the delete-icon" or "find the link with '@harrypotter') or selecting by attributes (e.g. click the pager-item that goes to next page) or selecting items outside of body (e.g. og-tags, title etc). All are doable in CSS3 selectors, but everything shouts that they are not meant for this; whereas xpath does this far more natural.
The element(s) before an element: //h3/preceding-sibling::p[1] Match something's parent: //title/.. Match all ancestors: //title[@id = 'abc']/ancestor::comment

Element with src or href attr: //[@src or @href] or multiple conditions: //article[@state = "approved" and not(comments/comment)]

Element with more than two children: //ul[count(li) > 2] Element with matching descendents: //article[//video]

Element text containing substring: //p[contains(text(), "Foo")] Attribute containing substring: //a[ends-with(@href, ".jpg")]

Numerical attribute selection: //product[@price > round(2.5 @discount)] //product[sum(//[starts-with(name(), 'price-')]/@price) > 0]

Attribute values: //a/@href Text values with spaces normalised: //a/normalize-space(text())

Match all attributes or elements or text nodes: //user/@ or //user/node() or //user/text() or //user/comment()

Basically from any node in a document you can select its ancestors, children, descendants, siblings, attributes etc, and filtering has the same power as selecting does - in CSS there's :not() that can apply to selection or filtering, with :has() finally on the way and no :or(). CSS selectors match against HTML elements and they're great for that almost all of the time, but while you can filter by attribute value including substring and even by regular expression, for text there's :empty.

But for a query syntax you need to be able to select attributes and text content as well as elements. Either extend XPath to support #id and .class syntax

//#user-xyz//note/text() //code.language-js/@name

or extend CSS to at allow selecting attrs and text

#user-xyz note :text code.language-js @name

The former is more powerful, the latter a quick hack (if they only appear at the end of the selector anyway) with instant payoff.

Searching text content is my main remaining use of XPath.
Well, the big one is selecting a parent from the child.
You could do this with the :has() CSS psuedo-class[0], though inverted (select a parent that _has_ the child matching a selector).

Looks like that psuedo-class has not been implemented in the kuchiki library that htmlq uses though.

[0]: https://developer.mozilla.org/en-US/docs/Web/CSS/:has

You can do it either way in XPath thanks to how you can use a path expression and/or predicates almost everywhere in a query

  # Find all elements li and select the parent element for each
  //li/.. 

  # Find all element nodes with a child element named li
  //*[li]

  # Non-abbreviated queries
  /descendant::li/parent::*
  /descendant::*[child::li]

  # CSS using :has
  :has(> li)