Hacker News new | ask | show | jobs
by logn 1949 days ago
I think scraping is just inherently brittle whether you go by the DOM traversal or by regex. AI may have the best potential. Regex can be slightly more brittle like you point out with commented html or myriad other problems, but it can also be less brittle than DOM if you craft more lenient patterns. The main problem I found was regex's not being performant due to recursiveness and stack overflows (Google's RE2 lib addresses this). My favorite performance trick is to use negated character classes rather than a dot, /<foo[^>]*>/