| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dfederschmidt 1746 days ago
	This looks very useful, big fan of all the ^[a-z]+q$ utilities out there. But as a user, I would probably want to use XPath[0] notation here. Maybe that is just me. A quick search revealed xidel[1] which seems to be similar, but supports XPath. [0]https://en.wikipedia.org/wiki/XPath [1]https://github.com/benibela/xidel

5 comments

akie 1746 days ago

I'd like to state my support for the author's choice of CSS selectors in this particular use case. I think it's a natural fit for this domain and already very well known, perhaps even known better than XPath.

link

berkes 1746 days ago

I'd like to add my support here too, but with a note.

When scraping and parsing (or writing integration test DSL), I always start out with CSS selectors. But always hit cases where they lack or require hoop-jumping and then fall back on Xpath. I then have a codebase with both CSS-Sel and Xpath, which is arguably worse then having only one method.

I suspect here, one uses this tool untill CSS selector limitations are getting in the way, after which one switches to another tool(chain)

link

Jenk 1746 days ago

I've not had much friction using either, they are "close enough" that the time to (re)write a query from one to the other is not very significant.

link

alpha_squared 1746 days ago

Do you mind giving an example? I'm having trouble following where CSS is limited for selection.

link

benibela 1745 days ago

XPath does general data processing not just selection

E.g. when you have a list of numbers on the website, XPath can calculate the sum or the maximum of the numbers

Or you have a list of names "Last name, First name", then you can remove the last name and sort the first names alphabetically. Or count how often each name occurs and return the most popular name.

Then it goes back to selection, e.g. select all numbers that are smaller than the average. Or calculate the most popular name, then select all elements containing that name

link

berkes 1746 days ago

Like other commentor says: parent/child. But also selecting by content (e.g. "click the button with the delete-icon" or "find the link with '@harrypotter') or selecting by attributes (e.g. click the pager-item that goes to next page) or selecting items outside of body (e.g. og-tags, title etc). All are doable in CSS3 selectors, but everything shouts that they are not meant for this; whereas xpath does this far more natural.

link

spiralx 1745 days ago

The element(s) before an element: //h3/preceding-sibling::p[1] Match something's parent: //title/.. Match all ancestors: //title[@id = 'abc']/ancestor::comment

Element with src or href attr: //[@src or @href] or multiple conditions: //article[@state = "approved" and not(comments/comment)]

Element with more than two children: //ul[count(li) > 2] Element with matching descendents: //article[//video]

Element text containing substring: //p[contains(text(), "Foo")] Attribute containing substring: //a[ends-with(@href, ".jpg")]

Numerical attribute selection: //product[@price > round(2.5 @discount)] //product[sum(//[starts-with(name(), 'price-')]/@price) > 0]

Attribute values: //a/@href Text values with spaces normalised: //a/normalize-space(text())

Match all attributes or elements or text nodes: //user/@ or //user/node() or //user/text() or //user/comment()

Basically from any node in a document you can select its ancestors, children, descendants, siblings, attributes etc, and filtering has the same power as selecting does - in CSS there's :not() that can apply to selection or filtering, with :has() finally on the way and no :or(). CSS selectors match against HTML elements and they're great for that almost all of the time, but while you can filter by attribute value including substring and even by regular expression, for text there's :empty.

But for a query syntax you need to be able to select attributes and text content as well as elements. Either extend XPath to support #id and .class syntax

//#user-xyz//note/text() //code.language-js/@name

or extend CSS to at allow selecting attrs and text

#user-xyz note :text code.language-js @name

The former is more powerful, the latter a quick hack (if they only appear at the end of the selector anyway) with instant payoff.

link

unspecified 1745 days ago

Searching text content is my main remaining use of XPath.

link

vlunkr 1746 days ago

Well, the big one is selecting a parent from the child.

link

androceium 1745 days ago

You could do this with the :has() CSS psuedo-class[0], though inverted (select a parent that _has_ the child matching a selector).

Looks like that psuedo-class has not been implemented in the kuchiki library that htmlq uses though.

[0]: https://developer.mozilla.org/en-US/docs/Web/CSS/:has

link

spiralx 1745 days ago

You can do it either way in XPath thanks to how you can use a path expression and/or predicates almost everywhere in a query

  # Find all elements li and select the parent element for each
  //li/.. 

  # Find all element nodes with a child element named li
  //*[li]

  # Non-abbreviated queries
  /descendant::li/parent::*
  /descendant::*[child::li]

  # CSS using :has
  :has(> li)

link

mirekrusin 1746 days ago

Playwright ppl had to solve this for themselves, you can mix them as they are distinct, have few small custom modifications to help with selectors. Playwright compatible selectors would be nice.

link

chriswarbo 1746 days ago

My web scraping tends to start with xidel. If I need a little bit more power I'll use xmlstarlet. If neither of those is enough, I'll use Python's beautifulsoup package :)

link

mikepurvis 1745 days ago

I like xmlstarlet too, if only because it's old enough that I can reliably get it in package repositories and the dependency footprint is tiny (less an issue now with this tool written in Rust, but previously I was comparing to NPM- and PyPI-based affairs).

link

spiralx 1745 days ago

lxml is one of the most pleasing to use Python libraries ever, managing to wrap a hot mess of XML APIs in a consistent and Pythonic fashion that you rarely need to escape. IIRC I used beautifulsoup to parse the HTML of a site, and then lxml and either find items and fields by CSS in IPython for quick and dirty data munging, or knock up an XSLT file to transform what I'd scraped into good data in an XML file :)

link

exyi 1746 days ago

Thanks, this looks more powerfull. Support CSS, XPath and XQuery. Maybe I could learn a bit of XQuery when I have a use case for it :)

link

dmit 1746 days ago

Well, here’s your first lesson then: if you prepend (: to your comment it will become a valid XQuery document!

(: XQuery comments are marked by mirrored smilie faces, like this. :)

link

benibela 1745 days ago

Well, yes, but also no

An empty query is not valid. There needs to be something besides the comment

link

bdcravens 1745 days ago

Nice - I've been writing XQuery for years and I had no clue

link

spiralx 1745 days ago

Everything that isn't a (: happy comments :) is a FLWOR:

  <users>
  {
    for $user in //users
    let $comments = //comment[@uid = $user/@id]
    where count($comments) > 0
    order by $user/lastName, $user/firstName
    return <user id="{ $user/@id }">
      <name>{ concat($user.firstName, " ", $user.lastName) }</name>
      <comments count="count($comments)">
      {
        for $c in $comments return <comment id="{ $c/@id }" />
      }
      </comments>
    </user>
  }
  </users>

It's the bastard child of SQL and XPath 2 lol.

http://www.stylusstudio.com/xquery-flwor.html

link

phlummox 1741 days ago

I kinda liked XQuery, but it seemed to never have got much traction.

link

lilyball 1745 days ago

This looks really neat! It supports a bunch of different query types, and can even do things like follow links to get info about the linked-to pages!

It's also in nixpkgs, though for some reason the nixpkgs derivation is marked as linux-only (i.e. not Darwin). (Edit: probably because the fpc dependency is also Linux-only, with a linux-specific patch and a comment suggesting that supporting other platforms would require adding per-platform patches)

link

waynenilsen 1746 days ago

part of the problem with this is that HTML is mostly not valid XML

link