| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kbenson 2811 days ago

The words of the statement matter in specificity. You can parse HTML with a powerful regular expression, but it's not a good tool for the job. That said, I find it a wonderful tool to extract specific portions of an HTML document.

If you actually just care about retrieving a few specific bits of data within a page, I've found parsing libraries (including ones that allow for CSS selectors) to be just as brittle to changes as regular expression extraction, and not all that much easier to use, given a good grasp of both technologies.

That said, if you need to alter an HTML document in some non-trivial way, parsing is probably the way to go.

1 comments

jandrese 2811 days ago

We had two versions of a particular app once. One used BeautifulSoup to parse the page and pull out the relevant elements. The other used some crusty old Regex patterns. At the end of the day the Regex version required about half of the maintenance the tag soup version did. IMHO the difference was that it took some of the content into consideration unlike the tag only version that was more sensitive to otherwise invisible changes under the hood.

link

Vindicis 2811 days ago

Not to mention the sheer difference in performance between the two. I've found regexes to be magnitudes faster than parsers, for extracting data, that is.

link