|
|
|
|
|
by rspeer
3735 days ago
|
|
The fortunate thing is, almost all of the broken sequences are unambiguous enough to be signs that the text should be encoded and then re-decoded as UTF-8. (This is not the case with any arbitrary encoding mixup -- if you mix up Big5 with EUC-JP, you might as well throw out your text and start over -- but it works for UTF-8 and the most common other encodings because UTF-8 is well-designed.) So if you want a Python library that can do this automatically with an extremely low rate of false positives: https://github.com/LuminosoInsight/python-ftfy |
|