Hacker News new | ask | show | jobs
HTML parsing in Elixir with leex and yecc (notes.eellson.com)
67 points by eellson 3437 days ago
5 comments

Another great use of leex and yecc -- a SQL parser from folks at Basho:

https://github.com/basho/riak_ql

Specifically:

https://github.com/basho/riak_ql/blob/develop/src/riak_ql_le...

https://github.com/basho/riak_ql/blob/develop/src/riak_ql_pa...

It is a very concise and well written piece of software.

The best library I've found for this sort of thing is gumbo. https://github.com/google/gumbo-parser

With its help I've created scrapers and crawlers that digest even the most disgusting HTML.

Hmm... This seems more like XML parsing to me than HTML parsing - in particular, there's no handling of (completely valid) omitted end tags.

Definitely interesting though.

The article mentions Floki which incidentally just added support for the servo/html5ever parser written in rust.

https://github.com/hansihe/ex_html5ever

Excellent article about creating parsers though even if html parsing is a particularly difficult problem.

Floki is a great lib, used it to write a very basic URL polling CLI tool in just 72 lines of code: https://github.com/vikeri/proba/blob/master/lib/proba.ex
As the author says, this is a toy project to learn Elixir; don't use in production, especially not on dynamic/user content.