| HN Mirror

:thumbs up:

A second iteration of review:

* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;

* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.

* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....

* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...)

* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.

For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.