Hacker News new | ask | show | jobs
by westurner 1477 days ago
BeautifulSoup is an API for multiple parsers https://beautiful-soup-4.readthedocs.io/en/latest/#installin... :

  BeautifulSoup(markup, "html.parser") 
  BeautifulSoup(markup, "lxml")
  BeautifulSoup(markup, "lxml-xml")
  BeautifulSoup(markup, "xml") 
  BeautifulSoup(markup, "html5lib")
Looks like lxml w/ xpath is still the fastest with Python 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison" https://gist.github.com/MercuryRising/4061368 ; which is fine for parsing (X)HTML(5) that validates<

(EDIT: Is xml/html5 a good format for data serialization? defusedxml ... Simdjson, Apache arrow.js)

2 comments

I was curious, so I tried that performance test you linked to on my machine with the various parsers:

    ==== Total trials: 100000 =====
    bs4 lxml total time: 110.9
    bs4 html.parser total time: 87.6
    bs4 lxml-xml total time: 0.5
    bs4 xml total time: 0.5
    bs4 html5lib total time: 103.6
    pq total time: 8.7
    lxml (cssselect) total time: 8.8
    lxml (xpath) total time: 5.6
    regex total time: 13.8 (doesn't find all p)
bs4 is damn fast with the lxml-xml or xml parsers
You want a proper html 5 parser that can handle non valid documents. And the fastest one is https://github.com/kovidgoyal/html5-parser over 30x faster than html5lib