Hacker News new | ask | show | jobs
by SimonSapin 4040 days ago
This is actually where the name is from, I found it too funny to pass up: https://simonsapin.github.io/wtf-8/#acknowledgments https://twitter.com/koalie/status/506821684687413248

Sorry for hijacking it!

2 comments

>  the future of publishing at W3C

That is an amazing example.

It's not even "double UTF-8", it's UTF-8 six times (including the one to get it on the Web), it's been decoded as Latin-1 twice and Windows-1252 three times, and at the end there's a non-breaking space that's been converted to a space. All to represent what originated as a single non-breaking space anyway.

Which makes me happy that my module solves it.

    >>> from ftfy.fixes import fix_encoding_and_explain
    >>> fix_encoding_and_explain(" the future of publishing at W3C")
    ('\xa0the future of publishing at W3C',
     [('encode', 'sloppy-windows-1252', 0),
      ('transcode', 'restore_byte_a0', 2),
      ('decode', 'utf-8-variants', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'sloppy-windows-1252', 0),
      ('decode', 'utf-8', 0),
      ('encode', 'latin-1', 0),
      ('decode', 'utf-8', 0)])
Hey, is there any way I could automate this kind of fix? It'd be awesome for web scraping.
Automating this fix is precisely what I'm showing off. And yes, it's damn useful for web scraping.

https://github.com/LuminosoInsight/python-ftfy

Neato! I wrote a shitty version of 50% of that two years ago, when I was tasked with uncooking a bunch of data in a MySQL database as part of a larger migration to UTF-8. I hadn't done that much pencil-and-paper bit manipulation since I was 13.
Awesome module! I wonder if anyone else had ever managed to reverse-engineer that tweet before.
The term "WTF-8" has been around for a long time. Here's an example from 2008:

http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft...

I love this.

    The key words "WHAT", "DAMNIT", "GOOD GRIEF", "FOR HEAVEN'S SAKE",
    "RIDICULOUS", "BLOODY HELL", and "DIE IN A GREAT BIG CHEMICAL FIRE"
    in this memo are to be interpreted as described in [RFC2119].