Hacker News new | ask | show | jobs
by rspeer 3735 days ago
The fortunate thing is, almost all of the broken sequences are unambiguous enough to be signs that the text should be encoded and then re-decoded as UTF-8. (This is not the case with any arbitrary encoding mixup -- if you mix up Big5 with EUC-JP, you might as well throw out your text and start over -- but it works for UTF-8 and the most common other encodings because UTF-8 is well-designed.)

So if you want a Python library that can do this automatically with an extremely low rate of false positives: https://github.com/LuminosoInsight/python-ftfy