| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mikkelewis 4559 days ago

Other than the new generational GC, I'm most excited about String#scrub and String#freeze.

String#scrub: https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097...

String#freeze example:

https://github.com/rails/rails/pull/12879

1 comments

jrochkind1 4559 days ago

I am pleased someone else is excited about String#scrub.

For those who need #scrub behavior in ruby 1.9 or 2.0, I wrote a gem a while ago to do this -- I wrote it before I was aware of the upcoming String#scrub api -- I'll maybe change it to provide a String#scrub 'backfill' now, with monkey patching even?

https://github.com/jrochkind/ensure_valid_encoding

link

ibrahima 4559 days ago

Oh wow, I actually ran into this recently, I couldn't figure out what to search for to fix it though. Nice to see that it's built in now.

(It was for a tiny weekend project where I was reading logs from the ZNC IRC bouncer to present a nice web UI for them, for some reason sometimes there were invalid characters and I didn't really understand why - possible that I was just reading them with the wrong input encoding but I think I tried a few different ones.)

link

MrBra 4558 days ago

Same here, had the same issues when working on a simple irc client. Basically what I did to make it work was: read the string with a given encoding(say UTF8), them check if the read string was a valid string with said encoding (no invalid characters). If this was the case, that means it found the correct encoding otherwise loop again and try with another encoding (I used cp1252 as a second guess), until the resulting string had no invalid characters. Used to work pretty good and didn't crash anymore when facing "unexpected" characters... I am curious on whether String#scrub is implemented in a similar way?

link

jrochkind1 4558 days ago

String#scrub does not try to identify a proper encoding from a mystery encoding.

Rather, Strin#scrub simply removes invalid bytes from the input, by default replacing them with the unicode replacement char � (or simply "?" if not in a unicode encoding).

This is, for instance, what many editors and other software I've used will do too -- if you say to open a file in encoding X, and some bytes in it are invalid for encoding X, they will be replaced with � or ? in display.

I find it a pretty useful thing in my own software, where input is _supposed_ to be a given known encoding, but upstream providers sometimes provide data with corrupt bytes, errors, or sub-passages in wrong encoding. It's not really my software's job to come up with the 'real' encoding -- and there may be no 'correct' encoding, often the error is corrupt bytes or mixed encodings -- but it is my software's job to show what can be shown without raising.

I think I've seen other gems that try to use heuristics to guess or discover an appropriate encoding for text with no known encoding. But String#scrub is not that. Here's some gems that say they'll do that (I have no experience with any of tem): https://github.com/brianmario/charlock_holmes ; https://github.com/jmhodges/rchardet ; https://github.com/janx/chardet2

link

MrBra 4549 days ago

Thanks for the clarification on #scrub. With regards to guessing a given encoding, I remember trying some (or probably all) of those gems or those that were around at the time I was writing the app, but for a reason or another (I think some were not recently updated) I couldn't get them working so I came up with my own little solution. Anyway, thanks for taking the time to list them.

link

shanemhansen 4558 days ago

Irc has a weird encoding by default. I think it's called Latin1/irc hybrid.

link

radq 4559 days ago

Looks like there is already a gem backporting String#scrub: https://github.com/hsbt/string-scrub/

It was mentioned in the changelog.

link

jrochkind1 4559 days ago

awesome, thanks. Looks like that backport may be only for ruby 2.0 (not 1.9), and is a compiled C extension.

It won't take many lines of pure ruby code to do it for ruby 1.9 too, although presumably not performing quite as well as a C version.

At any rate, this is definitely something I and people I know need to do all the time, although apparently most ruby devs never need to do it; but I'm glad it's finally made it into stdlib.

link

jrochkind1 4558 days ago

That backport works on MRI 2.0.

I've completed a pure-ruby polyfill that should work on 1.9 as well as 2.0, any ruby interpreter including jruby. (It does have some issues mentioned in the readme).

https://github.com/jrochkind/scrub_rb

link