I am pleased someone else is excited about String#scrub.
For those who need #scrub behavior in ruby 1.9 or 2.0, I wrote a gem a while ago to do this -- I wrote it before I was aware of the upcoming String#scrub api -- I'll maybe change it to provide a String#scrub 'backfill' now, with monkey patching even?
Oh wow, I actually ran into this recently, I couldn't figure out what to search for to fix it though. Nice to see that it's built in now.
(It was for a tiny weekend project where I was reading logs from the ZNC IRC bouncer to present a nice web UI for them, for some reason sometimes there were invalid characters and I didn't really understand why - possible that I was just reading them with the wrong input encoding but I think I tried a few different ones.)
Same here, had the same issues when working on a simple irc client. Basically what I did to make it work was: read the string with a given encoding(say UTF8), them check if the read string was a valid string with said encoding (no invalid characters). If this was the case, that means it found the correct encoding otherwise loop again and try with another encoding (I used cp1252 as a second guess), until the resulting string had no invalid characters. Used to work pretty good and didn't crash anymore when facing "unexpected" characters... I am curious on whether String#scrub is implemented in a similar way?
String#scrub does not try to identify a proper encoding from a mystery encoding.
Rather, Strin#scrub simply removes invalid bytes from the input, by default replacing them with the unicode replacement char � (or simply "?" if not in a unicode encoding).
This is, for instance, what many editors and other software I've used will do too -- if you say to open a file in encoding X, and some bytes in it are invalid for encoding X, they will be replaced with � or ? in display.
I find it a pretty useful thing in my own software, where input is _supposed_ to be a given known encoding, but upstream providers sometimes provide data with corrupt bytes, errors, or sub-passages in wrong encoding. It's not really my software's job to come up with the 'real' encoding -- and there may be no 'correct' encoding, often the error is corrupt bytes or mixed encodings -- but it is my software's job to show what can be shown without raising.
Thanks for the clarification on #scrub. With regards to guessing a given encoding, I remember trying some (or probably all) of those gems or those that were around at the time I was writing the app, but for a reason or another (I think some were not recently updated) I couldn't get them working so I came up with my own little solution.
Anyway, thanks for taking the time to list them.
awesome, thanks. Looks like that backport may be only for ruby 2.0 (not 1.9), and is a compiled C extension.
It won't take many lines of pure ruby code to do it for ruby 1.9 too, although presumably not performing quite as well as a C version.
At any rate, this is definitely something I and people I know need to do all the time, although apparently most ruby devs never need to do it; but I'm glad it's finally made it into stdlib.
I've completed a pure-ruby polyfill that should work on 1.9 as well as 2.0, any ruby interpreter including jruby. (It does have some issues mentioned in the readme).
For those who need #scrub behavior in ruby 1.9 or 2.0, I wrote a gem a while ago to do this -- I wrote it before I was aware of the upcoming String#scrub api -- I'll maybe change it to provide a String#scrub 'backfill' now, with monkey patching even?
https://github.com/jrochkind/ensure_valid_encoding