Y
Hacker News
new
|
ask
|
show
|
jobs
by
olegp
6429 days ago
Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.
What experiences has everyone else had?
3 comments
Harkins
6429 days ago
I've had a lot of success with BeautifulSoup. It turns terrible HTML into a usable DOM tree.
link
joseakle
6429 days ago
i am using beatiful soup, it works with malformed markup
link
DenisM
6429 days ago
Uhm. I though any valid HTML is also valid SGML? Are you sure you're not confusing it with XML markup?
link
ryanwaggoner
6429 days ago
I think he's saying that the HTML-parsing approach only works when the HTML is well-formed and for most sites, it isn't.
link