| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by westurner 1477 days ago

BeautifulSoup is an API for multiple parsers https://beautiful-soup-4.readthedocs.io/en/latest/#installin... :

  BeautifulSoup(markup, "html.parser") 
  BeautifulSoup(markup, "lxml")
  BeautifulSoup(markup, "lxml-xml")
  BeautifulSoup(markup, "xml") 
  BeautifulSoup(markup, "html5lib")

Looks like lxml w/ xpath is still the fastest with Python 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison" https://gist.github.com/MercuryRising/4061368 ; which is fine for parsing (X)HTML(5) that validates<

(EDIT: Is xml/html5 a good format for data serialization? defusedxml ... Simdjson, Apache arrow.js)

2 comments

driscoll42 1477 days ago

I was curious, so I tried that performance test you linked to on my machine with the various parsers:

    ==== Total trials: 100000 =====
    bs4 lxml total time: 110.9
    bs4 html.parser total time: 87.6
    bs4 lxml-xml total time: 0.5
    bs4 xml total time: 0.5
    bs4 html5lib total time: 103.6
    pq total time: 8.7
    lxml (cssselect) total time: 8.8
    lxml (xpath) total time: 5.6
    regex total time: 13.8 (doesn't find all p)

bs4 is damn fast with the lxml-xml or xml parsers

link

aumerle 1477 days ago

You want a proper html 5 parser that can handle non valid documents. And the fastest one is https://github.com/kovidgoyal/html5-parser over 30x faster than html5lib

link