| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by timdeve 1289 days ago
	I've been using it for quick web scrapping scripts and it's really nice.

1 comments

nerpderp82 1289 days ago

What libraries do you use? I do most of my scraping in Python using beautifulsoup.

link

Borkdude 1289 days ago

Babashka doesn't have a built-in HTML parsing library but it supports it through pods:

https://github.com/babashka/pod-registry

Pods can be written in any language and they can expose functions to babashka by implementing a protocol.

One pod exposing HTML parsing is:

https://github.com/retrogradeorbit/bootleg

Here is an example of how to use that:

https://github.com/babashka/pod-registry/blob/master/example...

link

timdeve 1289 days ago

As other people have said Bootleg + Hickory.

Here is an, admitedly not very clean, example that grabs stream urls from hltv.org:

https://github.com/TimDeve/.dotfiles/blob/master/scripts/gen...

Also a basic RSS reader using the clojure XML lib:

https://github.com/TimDeve/.dotfiles/blob/master/scripts/gen...

link

noblepayne 1289 days ago

As mentioned by the one and only Borkdude, bootleg is a nice option for this.

It includes the Hickory library: https://github.com/clj-commons/hickory

I'm a previous BeautifulSoup user and have found the combination of (1) having the scraped data presented in plain Clojure data structures, and (2) Hickory's built in selectors, to be a very nice experience.

Happy scraping!

link

aeonik 1289 days ago

Not OP but I use Reaver with good results. It supports all of JSoup's selectors, and makes it very clean to extract data from HTML.

The documentation is a little lacking though, I had to look up other examples on GitHub to figure out how to use all the features.

https://github.com/mischov/reaver

link

nathell 1289 days ago

I plan to port my scraping framework (Skyscraper, https://github.com/nathell/skyscraper) to babashka one day. I’m not sure how easy it will be, though, since it uses core.async (which I believe bb has limited support for) and SQLite via clojure.java.jdbc.

link

kot-behemoth 1288 days ago

Core.async is listed as “built-in” in the Babashka Toolbox (https://babashka.org/toolbox). Might be worth checking if the compatibility has improved.

And bb supports Honey SQL and SQL pods (https://github.com/babashka/babashka-sql-pods) - so you might be already compatible!

link

nerpderp82 1289 days ago

Thanks everyone, all great pointers. I'll do my next scrape in Babashka!

link