|
|
|
|
|
by jrochkind1
4559 days ago
|
|
String#scrub does not try to identify a proper encoding from a mystery encoding. Rather, Strin#scrub simply removes invalid bytes from the input, by default replacing them with the unicode replacement char � (or simply "?" if not in a unicode encoding). This is, for instance, what many editors and other software I've used will do too -- if you say to open a file in encoding X, and some bytes in it are invalid for encoding X, they will be replaced with � or ? in display. I find it a pretty useful thing in my own software, where input is _supposed_ to be a given known encoding, but upstream providers sometimes provide data with corrupt bytes, errors, or sub-passages in wrong encoding. It's not really my software's job to come up with the 'real' encoding -- and there may be no 'correct' encoding, often the error is corrupt bytes or mixed encodings -- but it is my software's job to show what can be shown without raising. I think I've seen other gems that try to use heuristics to guess or discover an appropriate encoding for text with no known encoding. But String#scrub is not that. Here's some gems that say they'll do that (I have no experience with any of tem): https://github.com/brianmario/charlock_holmes ; https://github.com/jmhodges/rchardet ; https://github.com/janx/chardet2 |
|