|
|
|
|
|
by MrBra
4564 days ago
|
|
Same here, had the same issues when working on a simple irc client. Basically what I did to make it work was: read the string with a given encoding(say UTF8), them check if the read string was a valid string with said encoding (no invalid characters). If this was the case, that means it found the correct encoding otherwise loop again and try with another encoding (I used cp1252 as a second guess), until the resulting string had no invalid characters. Used to work pretty good and didn't crash anymore when facing "unexpected" characters... I am curious on whether String#scrub is implemented in a similar way? |
|
Rather, Strin#scrub simply removes invalid bytes from the input, by default replacing them with the unicode replacement char � (or simply "?" if not in a unicode encoding).
This is, for instance, what many editors and other software I've used will do too -- if you say to open a file in encoding X, and some bytes in it are invalid for encoding X, they will be replaced with � or ? in display.
I find it a pretty useful thing in my own software, where input is _supposed_ to be a given known encoding, but upstream providers sometimes provide data with corrupt bytes, errors, or sub-passages in wrong encoding. It's not really my software's job to come up with the 'real' encoding -- and there may be no 'correct' encoding, often the error is corrupt bytes or mixed encodings -- but it is my software's job to show what can be shown without raising.
I think I've seen other gems that try to use heuristics to guess or discover an appropriate encoding for text with no known encoding. But String#scrub is not that. Here's some gems that say they'll do that (I have no experience with any of tem): https://github.com/brianmario/charlock_holmes ; https://github.com/jmhodges/rchardet ; https://github.com/janx/chardet2