| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rspeer 4040 days ago

> ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C

That is an amazing example.

It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.

Which makes me happy that my module solves it.

    >>> from ftfy.fixes import fix_encoding_and_explain
    >>> fix_encoding_and_explain("ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C")
    ('\xa0the future of publishing at W3C',
     [('encode', 'sloppy-windows-1252', 0),
      ('transcode', 'restore_byte_a0', 2),
      ('decode', 'utf-8-variants', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0)])

3 comments

voltagex_ 4040 days ago

Hey, is there any way I could automate this kind of fix? It'd be awesome for web scraping.

link

rspeer 4040 days ago

Automating this fix is precisely what I'm showing off. And yes, it's damn useful for web scraping.

https://github.com/LuminosoInsight/python-ftfy

link

gamache 4040 days ago

Neato! I wrote a shitty version of 50% of that two years ago, when I was tasked with uncooking a bunch of data in a MySQL database as part of a larger migration to UTF-8. I hadn't done that much pencil-and-paper bit manipulation since I was 13.

link

haberman 4040 days ago

Awesome module! I wonder if anyone else had ever managed to reverse-engineer that tweet before.

link