| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by egberts1 1171 days ago

I know of no Regex pattern that can handle all the old and new HTML as well as HTML5: believe me, as one who is looking to put HTML parser on FPGA/ASIC for higher speed, I've actually forayed down this rabbit hole a few times in the fruitless pursuit of identifying this elusive pure Regex pattern for HTML, et. al. Problem is in Regex's lack of support for multiple state machine and its needed interactions between these state machines.

The language Perl came closest to the smallest HTML parser.

Things to do before doing simplistic regex on HTML using some multiple passes of Regex are probably required, probably in order of (my 20yo memory failing here):

- de-CDATA

- De-pairing of quotes

- De-symbolization of HTML symbols, entities. and codes (de-escaping)

- lone unterminated </> (ie. <p>)

Before you can even hit up for pairing of <XXX> and </XXX> and getting to its HTML tags and attributes.

In short, additional scripting is required to conduct the applying of multiple Regex patterns before one can even be getting into properly parsing the HTML.

Simplest that I've gotten is using both bash logic and Regex, but it fails on certain HTML codes.

Federico Tommassetti, well-renown expert on domain specific languages and transpiliers, covers nearly all the valid libraries of many modern languages for just the parsing of HTML.

Federico makes it easier for first timer of HTML parser coding to that that first step: selecting an HTML parser library.

https://tomassetti.me/parsing-html/

1 comments

stevefan1999 1170 days ago

Regex if extended can go as far as Turing-complete

Meanwhile regular expression (the OG Regex) is just an NFA and should be easier to implement in circuit. The problem is an NFA circuit still needs exponential expansion (if minimized to DFA which is just power set of encoding and eliminating possible NFA states), and with Turing complete Regex you have halting problem -- both are hellish to solve unless P=NP

link

egberts1 1170 days ago

Yeah. Doing at firmware-level this Regex-stack-tracking of multiple but separate data streams requires some CPU assist for this "halting" problem.

link