Hacker News new | ask | show | jobs
by lobster_johnson 4269 days ago
Depends a lot on the libraries, too. I had to scrape a bunch of HTML recently, which I prefer to use XPath for; the library I used -- HXT, if I remember correctly, it was the horrible one that uses arrows -- made my program perform on par with Ruby, and when I benchmarked it, I found it was allocating about 2GB of data throughout the program, while parsing a document that was probably around 100KB.
1 comments

I believe HXT uses the default representation of strings as lists of chars, instead of more efficient packed representations. This likely contributes to the excessive memory usage.
Sure. As a decidely unseasoned Haskell user, however, it's hard to sympathize with inefficient libraries for something as established as XML.

There may be other, faster libs that I don't know about, but I couldn't find them. I tried HaXml first (from which HXT is apparently derived), but the parser choked on my document and the author didn't come forward with a fix when I reported the problem (by email, the project isn't on Github). There is one called HXML, but I think it's dead. The TagSoup library might have worked, but I don't think so. It's not easy jumping into a new language and then coming up against library issues that prevent you from finishing your first project.

The "String problem" is definitely one of the most unfortunate parts of Haskell. Using a linked list of chars for a string is just laughable from a performance and resources standpoint. The good news is that the problem should be solved now: we have Data.Text for unicode strings, and Data.ByteString for binary/ASCII/UTF-8 strings. Both are very efficient and implement a robust API for common string operations. The bad news is that there are still far too many libraries that use the old crummy data type for strings, including much of the Prelude. And, I guess in the interest of simplicity, many beginner tutorials tend to use String as well. This is quite unfortunate, but it does seem to be changing: Aeson uses Data.Text, the ClassyPrelude ditches String almost entirely (keeping it for Show only), and in general most modern libraries avoid String.

Hopefully HXT will be updated to use modern string types soon. In the meantime, I believe that xml-conduit (http://hackage.haskell.org/package/xml-conduit) might be what is desired.

Why don't you think TagSoup would have worked? I've used it for quite a few use cases.

edit: Then to make things dead simple, add on dom-selector:

http://hackage.haskell.org/package/dom-selector

It enables using css selectors like so:

    queryT [jq| h2 span.titletext |] root
The project wasn't that recent, so I don't quite remember, but I would have wanted something like dom-selector, and that one didn't come up in my searches for solutions.

It's interesting that XML libs have to invent operators and obnoxious syntax (like HXT's arrow usage, or coincidentally the fact that HXT's parser uses the IO type, which is just crazy talk). dom-selector seems to have the same problem. I prefer readable functions, not DSLs where my code suddenly descends into this magic bizarro-world of operator soup for a moment.

Lenses would make tree-based extraction easier, I think, although lenses aren't easy to understand or that easy to read. Tree traversal with lenses and zippers seems unnecessarily complicated to me.

In a scraper you just want to collect items recursively, and return empty/Nothing values for anything that fails a match: Collect every item that contains a <div class="h-sku productinfo">, map its h2 to a title and its <div class="price"> to a price, and then combine those two fields into a record. It's something that should result in eminently readable code, not just because it's a conceptually trivial task, but also because someday you need to go back to the code and remember how it works.

> I prefer readable functions, not DSLs where my code suddenly descends into this magic bizarro-world of operator soup for a moment.

Bizarro world of operator soup? I don't really follow you. That dom selector code just compiles down into functions itself. I don't see how anything could be any clearer than a css selector for selecting an html element.