| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dsshimel 4897 days ago
	I was stunned when I got the notification email to moderate him =D

1 comments

aleyan 4897 days ago

I was surprised by his recommendation. I found lxml, with it's weak xpath implementation to be a poor tool for parsing XML. How can it be a good tool for parsing non conforming HTML?

Perhaps my reply to Linus was snarky, but in my experience Beautiful Soup was easy to use for parsing HTML, where lxml for parsing XML was not. Granted the XML I had to parse used name spaces, but there was no reason for it to be difficult and there was no reason for poorly documenting which xpath features are supported.

link

hnriot 4897 days ago

No, you were right and be was wrong. Just because he made a copy of unix doesn't mean he knows how to parse HTML with python. Anyone that's done a lots of web scraping or (as in my case) crawling/indexing will see the merit of your response.

link

rosenjon 4897 days ago

Was that actually Linus Torvalds? I just assumed it was someone trolling as him. For web scraping, PhantomJS is much better than Beautiful Soup or any xml parsing library. Lots of stuff happens in JS these days, so you need programmatic access to the DOM to really grab data efficiently. In fact, lots of sites hide important values inside JS in order to thwart libraries that post-process html/xml like Beautiful Soup.

link

berlinbrown 4897 days ago

It is interesting, when you read a lot of his stuff. It seems that absolutely did not want to worry about the "user" space. User software.

I always wondered, did he ever write user level code. He did write 'git' eventually. But I wonder what else, what does he think of web application development, etc?

link

icelancer 4897 days ago

>But I wonder what else, what does he think of web application development, etc?

He doesn't.

link