Hacker News new | ask | show | jobs
by dsshimel 4850 days ago
I was stunned when I got the notification email to moderate him =D
1 comments

I was surprised by his recommendation. I found lxml, with it's weak xpath implementation to be a poor tool for parsing XML. How can it be a good tool for parsing non conforming HTML?

Perhaps my reply to Linus was snarky, but in my experience Beautiful Soup was easy to use for parsing HTML, where lxml for parsing XML was not. Granted the XML I had to parse used name spaces, but there was no reason for it to be difficult and there was no reason for poorly documenting which xpath features are supported.

No, you were right and be was wrong. Just because he made a copy of unix doesn't mean he knows how to parse HTML with python. Anyone that's done a lots of web scraping or (as in my case) crawling/indexing will see the merit of your response.
Was that actually Linus Torvalds? I just assumed it was someone trolling as him. For web scraping, PhantomJS is much better than Beautiful Soup or any xml parsing library. Lots of stuff happens in JS these days, so you need programmatic access to the DOM to really grab data efficiently. In fact, lots of sites hide important values inside JS in order to thwart libraries that post-process html/xml like Beautiful Soup.
It is interesting, when you read a lot of his stuff. It seems that absolutely did not want to worry about the "user" space. User software.

I always wondered, did he ever write user level code. He did write 'git' eventually. But I wonder what else, what does he think of web application development, etc?

>But I wonder what else, what does he think of web application development, etc?

He doesn't.