|
>  the future of publishing at W3C That is an amazing example. It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway. Which makes me happy that my module solves it. >>> from ftfy.fixes import fix_encoding_and_explain
>>> fix_encoding_and_explain(" the future of publishing at W3C")
('\xa0the future of publishing at W3C',
[('encode', 'sloppy-windows-1252', 0),
('transcode', 'restore_byte_a0', 2),
('decode', 'utf-8-variants', 0),
('encode', 'sloppy-windows-1252', 0),
('decode', 'utf-8', 0),
('encode', 'latin-1', 0),
('decode', 'utf-8', 0),
('encode', 'sloppy-windows-1252', 0),
('decode', 'utf-8', 0),
('encode', 'latin-1', 0),
('decode', 'utf-8', 0)])
|