Hacker News new | ask | show | jobs
by jrochkind1 4559 days ago
String#scrub does not try to identify a proper encoding from a mystery encoding.

Rather, Strin#scrub simply removes invalid bytes from the input, by default replacing them with the unicode replacement char � (or simply "?" if not in a unicode encoding).

This is, for instance, what many editors and other software I've used will do too -- if you say to open a file in encoding X, and some bytes in it are invalid for encoding X, they will be replaced with � or ? in display.

I find it a pretty useful thing in my own software, where input is _supposed_ to be a given known encoding, but upstream providers sometimes provide data with corrupt bytes, errors, or sub-passages in wrong encoding. It's not really my software's job to come up with the 'real' encoding -- and there may be no 'correct' encoding, often the error is corrupt bytes or mixed encodings -- but it is my software's job to show what can be shown without raising.

I think I've seen other gems that try to use heuristics to guess or discover an appropriate encoding for text with no known encoding. But String#scrub is not that. Here's some gems that say they'll do that (I have no experience with any of tem): https://github.com/brianmario/charlock_holmes ; https://github.com/jmhodges/rchardet ; https://github.com/janx/chardet2

1 comments

Thanks for the clarification on #scrub. With regards to guessing a given encoding, I remember trying some (or probably all) of those gems or those that were around at the time I was writing the app, but for a reason or another (I think some were not recently updated) I couldn't get them working so I came up with my own little solution. Anyway, thanks for taking the time to list them.