Hacker News new | ask | show | jobs
by dfederschmidt 1746 days ago
This looks very useful, big fan of all the ^[a-z]+q$ utilities out there. But as a user, I would probably want to use XPath[0] notation here. Maybe that is just me. A quick search revealed xidel[1] which seems to be similar, but supports XPath.

[0]https://en.wikipedia.org/wiki/XPath [1]https://github.com/benibela/xidel

5 comments

I'd like to state my support for the author's choice of CSS selectors in this particular use case. I think it's a natural fit for this domain and already very well known, perhaps even known better than XPath.
I'd like to add my support here too, but with a note.

When scraping and parsing (or writing integration test DSL), I always start out with CSS selectors. But always hit cases where they lack or require hoop-jumping and then fall back on Xpath. I then have a codebase with both CSS-Sel and Xpath, which is arguably worse then having only one method.

I suspect here, one uses this tool untill CSS selector limitations are getting in the way, after which one switches to another tool(chain)

I've not had much friction using either, they are "close enough" that the time to (re)write a query from one to the other is not very significant.
Do you mind giving an example? I'm having trouble following where CSS is limited for selection.
XPath does general data processing not just selection

E.g. when you have a list of numbers on the website, XPath can calculate the sum or the maximum of the numbers

Or you have a list of names "Last name, First name", then you can remove the last name and sort the first names alphabetically. Or count how often each name occurs and return the most popular name.

Then it goes back to selection, e.g. select all numbers that are smaller than the average. Or calculate the most popular name, then select all elements containing that name

Like other commentor says: parent/child. But also selecting by content (e.g. "click the button with the delete-icon" or "find the link with '@harrypotter') or selecting by attributes (e.g. click the pager-item that goes to next page) or selecting items outside of body (e.g. og-tags, title etc). All are doable in CSS3 selectors, but everything shouts that they are not meant for this; whereas xpath does this far more natural.
The element(s) before an element: //h3/preceding-sibling::p[1] Match something's parent: //title/.. Match all ancestors: //title[@id = 'abc']/ancestor::comment

Element with src or href attr: //[@src or @href] or multiple conditions: //article[@state = "approved" and not(comments/comment)]

Element with more than two children: //ul[count(li) > 2] Element with matching descendents: //article[//video]

Element text containing substring: //p[contains(text(), "Foo")] Attribute containing substring: //a[ends-with(@href, ".jpg")]

Numerical attribute selection: //product[@price > round(2.5 @discount)] //product[sum(//[starts-with(name(), 'price-')]/@price) > 0]

Attribute values: //a/@href Text values with spaces normalised: //a/normalize-space(text())

Match all attributes or elements or text nodes: //user/@ or //user/node() or //user/text() or //user/comment()

Basically from any node in a document you can select its ancestors, children, descendants, siblings, attributes etc, and filtering has the same power as selecting does - in CSS there's :not() that can apply to selection or filtering, with :has() finally on the way and no :or(). CSS selectors match against HTML elements and they're great for that almost all of the time, but while you can filter by attribute value including substring and even by regular expression, for text there's :empty.

But for a query syntax you need to be able to select attributes and text content as well as elements. Either extend XPath to support #id and .class syntax

//#user-xyz//note/text() //code.language-js/@name

or extend CSS to at allow selecting attrs and text

#user-xyz note :text code.language-js @name

The former is more powerful, the latter a quick hack (if they only appear at the end of the selector anyway) with instant payoff.

Searching text content is my main remaining use of XPath.
Well, the big one is selecting a parent from the child.
You could do this with the :has() CSS psuedo-class[0], though inverted (select a parent that _has_ the child matching a selector).

Looks like that psuedo-class has not been implemented in the kuchiki library that htmlq uses though.

[0]: https://developer.mozilla.org/en-US/docs/Web/CSS/:has

You can do it either way in XPath thanks to how you can use a path expression and/or predicates almost everywhere in a query

  # Find all elements li and select the parent element for each
  //li/.. 

  # Find all element nodes with a child element named li
  //*[li]

  # Non-abbreviated queries
  /descendant::li/parent::*
  /descendant::*[child::li]

  # CSS using :has
  :has(> li)
Playwright ppl had to solve this for themselves, you can mix them as they are distinct, have few small custom modifications to help with selectors. Playwright compatible selectors would be nice.
My web scraping tends to start with xidel. If I need a little bit more power I'll use xmlstarlet. If neither of those is enough, I'll use Python's beautifulsoup package :)
I like xmlstarlet too, if only because it's old enough that I can reliably get it in package repositories and the dependency footprint is tiny (less an issue now with this tool written in Rust, but previously I was comparing to NPM- and PyPI-based affairs).
lxml is one of the most pleasing to use Python libraries ever, managing to wrap a hot mess of XML APIs in a consistent and Pythonic fashion that you rarely need to escape. IIRC I used beautifulsoup to parse the HTML of a site, and then lxml and either find items and fields by CSS in IPython for quick and dirty data munging, or knock up an XSLT file to transform what I'd scraped into good data in an XML file :)
Thanks, this looks more powerfull. Support CSS, XPath and XQuery. Maybe I could learn a bit of XQuery when I have a use case for it :)
Well, here’s your first lesson then: if you prepend (: to your comment it will become a valid XQuery document!

(: XQuery comments are marked by mirrored smilie faces, like this. :)

Well, yes, but also no

An empty query is not valid. There needs to be something besides the comment

Nice - I've been writing XQuery for years and I had no clue
Everything that isn't a (: happy comments :) is a FLWOR:

  <users>
  {
    for $user in //users
    let $comments = //comment[@uid = $user/@id]
    where count($comments) > 0
    order by $user/lastName, $user/firstName
    return <user id="{ $user/@id }">
      <name>{ concat($user.firstName, " ", $user.lastName) }</name>
      <comments count="count($comments)">
      {
        for $c in $comments return <comment id="{ $c/@id }" />
      }
      </comments>
    </user>
  }
  </users>
It's the bastard child of SQL and XPath 2 lol.

http://www.stylusstudio.com/xquery-flwor.html

I kinda liked XQuery, but it seemed to never have got much traction.
This looks really neat! It supports a bunch of different query types, and can even do things like follow links to get info about the linked-to pages!

It's also in nixpkgs, though for some reason the nixpkgs derivation is marked as linux-only (i.e. not Darwin). (Edit: probably because the fpc dependency is also Linux-only, with a linux-specific patch and a comment suggesting that supporting other platforms would require adding per-platform patches)

part of the problem with this is that HTML is mostly not valid XML