Y
Hacker News
new
|
ask
|
show
|
jobs
by
latenightcoding
227 days ago
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
1 comments
rightbyte
227 days ago
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
link
chaps
227 days ago
Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.
link
horseradish7k
227 days ago
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!
link