| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by berdario 4195 days ago

(I'll try to keep this short, since I feel this is quite offtopic, if we want to discuss this further I suppose we could find a better venue... maybe even email?)

I assume that with "Ruby 1.9 solution", you refer to the fact that Ruby source code is by default evaluated as UTF-8, right?

That's definitely a good thing, but with Python3 that wasn't the only change brought into the language.

I said "if Ruby ever decides to fix", because the need for a change is not obvious and not universally accepted: it's basically the same issue as automatic type coercion (aka weak/strong typing) and early (or late) raise/throw/handling of exception.

Basically: In Python2 and Ruby you have one or two String types (runtime types, in this discussion I only care about them), with the Ruby Strings tagged with the encoding internally used. In python the types are just "anything goes" (binary strings, the old Python2 string) and unicode (the actual internal encoding is an implementation detail).

The problem (if you agree that it is one) is that you can easily mix-and-match them, and everything will work fine only as long as the operation makes sense. When it won't anymore you'll get an exception.

This is a problem when you don't completely control the type/encoding of your input (e.g. if you have an HTTP request and your string depends on the type/charset specified in the Content-Type).

A dumb example of what could happen:

    a = "Ё".force_encoding "ISO-8859-5"
    a2 = "®".force_encoding "ISO-8859-1"
    # a + a2 will fail with Encoding::CompatibilityError

A similar thing can happen in Python2. While Python3 will reject the same operation as soon as the types get in contact with each other (still at runtime, but it'll be like doing `1+"1"` in Python or Ruby: you'll spot it right away).

I wrote a quite lengthy blog post about this change in Python3, but I haven't translated it in English yet, if there's some interest I could try to do it a bit sooner.

Anyhow, I don't want to create a flame or anything like that. I just wanted to explain why the Python3 choice has been made, and why a destructive change might have had its merit. While I prefer the Python3's approach, and I'm definitely not a Ruby developer, I still appreciate these updates to Ruby: for example I actually touched first hand the internal encoding-handling code of Ruby (Rubinius) some time ago with a friend of mine: http://git.io/7kM4Gw and I can benefit from the new GC code in new rubies, which makes metasploit 4 times faster to load.

2 comments

wycats 4195 days ago

The primary difference between Ruby and Python, from where I'm sitting, is that Ruby's change was purely at the API level, so it was feature-detectable and shimmable by libraries like Rails. I talk elsewhere in this thread about the effort we had to do, but the point is, we could do it.

In contrast, Python 3 changed the meaning of "foo". It also supported only u"foo" in Python 2 (to opt-in to unicode strings) and only b"foo" in Python 3 (to opt-in to byte-strings) for a fairly long period of time, making it extremely, extremely awkward (at best) to write a program with shims as abstractions that let most of the program remain oblivious to the differences.

Python 3.3 and 2.7 finally landed a lot of fixes to this kind of problem, but it landed fairly late, and after most of the community got a sense of the relative difficulty level of a transition to Python 3 that maintained support for Python 2 at the same time.

Both Ruby and JavaScript have taught me the value of a transition path to a new version that allows people to write libraries that support both the old and new version at the same time. Communities move a little at a time, especially long-term production projects. The best way to move them is via libraries that can serve as a bridge and target both the old and new version together.

link

jrochkind1 4195 days ago

Actually, the "ruby 1.9" solution is having Strings tagged with encoding at all -- prior to ruby 1.9 they were not, they were just bytes.

This was a pretty major change, I think I'd call it a 'destructive' change, it was indeed a big pain upgrading apps from ruby 1.8 to 1.9, and character encoding was the major issue generally.

I'm not sure I understand what you're saying about python 2 vs 3, or what you think needs to be changed in ruby. If I understand right, you're saying that it ought to be guaranteed to raise if you try to concatenate strings with different encoding.

Instead, at present, for encodings that are ascii-compatible (which is most encodings), ruby will will let you concatenate if both strings (or just the argument and not necessarily the receiver? I forget) are entirely composed of ascii-compatible chars, otherwise it will raise.

I think you're probably (although I'm not 100% confident) right that it would be better to 'fail fast' and always raise, requiring explici treatments of encodings, instead of depending on the nature of the arguments (which may have come from I/O), which makes bugs less predictable. There continues to be a lot of confusion about how char encodings work in rubyland, and it's possible a simpler model would be less confusing (although I suspect char encoding issues are confusing to some extent no matter what, by their nature).

In general, even as it is, I find dealing with char encodings more sane in ruby (1.9+) than any other language I've worked in (but I haven't worked in python).

If ruby ever decides to make things even more strict, I don't think it'll actually be as disruptive as the 1.8 to 1.9 transition. For anyone who ever deals with not-entirely-ascii text (and who doesn't?), they basically already had to deal with the issue. Ruby was trying to make the transition easier on the developer to make some circumstances where it would let you get away with being sloppy with encodings -- I'm not sure if it succeeded in making it any easier, the transition was pretty challenging anyway, and "fail fast" might actually have been easier, I think I agree if that's what you're saying.

I don't know enough about python to have an answer, but I continue to be curious about what differences resulted in the entire ruby community pretty much coming along on the ruby 1.8 to 1.9 jump (and subsequent less disruptive jumps), while the python community seems to have had more of a disjoint. I don't know if it was helped by ruby's attempt to make the encoding switch less painful with it's current behavior. Or if it's as simple as the 100-ton gorilla of Rails being able to make the community follow in ruby-land.

link

MichaelGG 4195 days ago

Excuse my density here, but why not just force all strings to be UTF8 and call it a day? Anything in another encoding would need to get converted. What am I missing that Ruby and Python need these complicated problems but other platforms don't have these issues?

link

Xylakant 4195 days ago

Han unification is the primary reason not to force utf8, especially in a language that has strong roots in Japan. (Sorry to be short, I'm on a smartphone. Googling should provide sufficient answers)

link

wycats 4195 days ago

Han unification is one problem; another problem is that not all encodings can be round-tripped losslessly through Unicode. Shift-JIS, for example, has multiple separate characters that convert into the same character in Unicode, and therefore cannot be converted back into their original form reliably.

link

wycats 4195 days ago

Citation: http://support.microsoft.com/kb/170559

link

MichaelGG 4195 days ago

The shift JIS issue seems to be a fault in the design of shift JIS, resulting in even symbols like square root not having a canonical encoding. At what point do you just draw the line and tell developers if they need to deal with such things themselves? No one is taking away byte arrays. Fragmenting the userbase seems suboptimal.

link

MichaelGG 4195 days ago

Wow. That's pretty ugly, thanks for the info. But for people not wanting to use Unicode... Does that not mean they simply cannot use strings in Java, .Net, Windows (to some extent), etc.? It just seems sorta not feasible at this point to not use Unicode. And according to Wikipedia, Unicode now has a way to select which language variant of a unified character. So is unification not as big a problem if people use selectors?

And what's the practical alternative? Keeping things in country specific encodings?

link

berdario 4195 days ago

> Actually, the "ruby 1.9" solution is having Strings tagged with encoding at all -- prior to ruby 1.9 they were not, they were just bytes.

Whoops, you're right... I confused the version, what I had in mind is the "source code as UTF-8 by default", which wasn't introduced in Ruby1.9, but in Ruby2.0

> If ruby ever decides to make things even more strict, I don't think it'll actually be as disruptive as the 1.8 to 1.9 transition.

admittedly, I almost never touched ruby1.8, so I've no idea how actually hard was the transition from ruby1.8.

I'm under the impression that before ruby1.9, Ruby was simply encoding-oblivious, and for any encoding-sensitive piece of code, people simply relied on things like libuconv. Am I mistaken?

If that's the case, the change from 1.8 to 1.9 was painful for sure, but it was more the case of actually caring about encoding for the very first time in a codebase.

This is quite recent (and it deals with Jruby, which is different underneath): http://blog.rayapps.com/2013/03/11/7-things-that-can-go-wron...

but by reading this blog post, I'm under the impression that most of the breakage that you'd get with the move to Ruby1.9 wouldn't be in exceptions, but in strings corruption.

Migrating to a fail-fast approach (like Python3), imho makes things more difficult ecosystem-wise, because you'll get plenty of exceptions even just when importing the library when first trying to use/update it.

With the Ruby1.9 upgrade, you could've used a library even if it was not 100% compatible and correctly working with Ruby1.9, I'd assume. This could let people gradually migrate and port their code, while reporting issues about corruption and fixing them as they appear.

Instead, if you're the author of a big Python2 library that relies on the encoding, maybe you won't prioritize the porting work, because you realize how much work is it, and the fact that unless you've actually correctly migrated 100% of the codebase, your users won't benefit for it (and so you have less of an incentive to start porting a couple classes/modules/packages)

That'd be compounded with the fact that, in Python2 like in Ruby, you actually already have your libraries and your codebase working in an internationalized environment... things might get more robust, but in the meanwhile everything will break, and the benefit isn't immediately available nor obvious.

The last straw is then obviously the community and memes: I don't believe that Python developers are more conservative (the ones that use virtualenv at least, and it's most of them in the web development industry I'd assume... things might be different in the scientific calculus, ops, desktop guis, pentest, etc industries), and they intrinsecally prefer stabler things. Not more than Ruby developers at least.

But for sure, memes like "Python2 and Python3 are two different languages" can demoralize and stifle initiatives to port libraries. And also some mistakes happened without any doubt (mistakes that embittered part of the community), but they've been realized only in hindsight: I'm talking about not keeping the u'' literal (which has been reintroduced in Python3.3) and proposing 2to3 as a tool to be used at build/installation time, instead of only as an helper during migration to a single Python2/3 codebase.

> If I understand right, you're saying that it ought to be guaranteed to raise if you try to concatenate strings with different encoding.

Let's say that while I'd prefer if Ruby behaved like this, I'm not advocating at-all for such a change, due to all the problems I just mentioned, and the fact that I wouldn't want any such responsibility :)

link

jrochkind1 4195 days ago

> I'm under the impression that before ruby1.9, Ruby was simply encoding-oblivious, and for any encoding-sensitive piece of code, people simply relied on things like libuconv.

True.

> but by reading this blog post, I'm under the impression that most of the breakage that you'd get with the move to Ruby1.9 wouldn't be in exceptions, but in strings corruption.

Eh... I don't know. In my experience, the encoding-related problems arising in the 1.9 move indeed generally arose as exceptions raised -- but because of ruby's attempt to let you get away with mixed encodings when they are both ascii compatible, you could _sometimes_ get those exceptions only on _certain input_, which could definitely make it terrible.

I am trying to think of any cases where you'd get corrupt bytes... the only ones I can think of is where you tried to deal with the transition without really understanding what was going on, by blindly calling `force_encoding` on Strings, when you were forcing them to a different encoding then they really were. You'd have to explicitly take (wrong) action to get corrupted bytes, you wouldn't get them on an upgrade otherwise -- you'd get raises, or you'd get working okay (if you stuck to ascii-compat bytes only).

Of course, one of your dependencies might be doing the wrong thing too, and infect your code with strings it returned to you -- it wouldn't have to be _you_ that did the wrong thing.

YAML serialization/de-serialization is sort of a special case, made worse by the fact that there was a transition between YAML engines in the stdlib too at that point, and that _neither_ really dealt with encodings properly, and they both did it differently! (Really, the whole yaml ecosystem, which is popular in rubyland, wasn't designed thinking properly about encodings).

Encoding of course can be tricky and confusing no matter what -- if you actually don't know what encoding your input is in, you can get corrupted bytes and/or exceptions. That's kind of an inherent part of dealing with encoding though. Once ruby 1.9, you couldn't get away with not understanding encoding anymore. I think there wasn't quite enough education and proper tools when ruby 1.9 came out (and still), perhaps the Japanese/English language barrier (and context difference! Japanese coders have different sorts of common issues with encoding) was part of that. String#scrub (replace invalid bytes with replacement chars) wasn't in the stdlib until very recently, and it was hard for me to get anyone to understand this was a problem when I needed it!

> With the Ruby1.9 upgrade, you could've used a library even if it was not 100% compatible and correctly working with Ruby1.9, I'd assume. This could let people gradually migrate and port their code, while reporting issues about corruption and fixing them as they appear.

Yes, that was sometimes (but not always) true. I'm not sure how much the encoding-related stuff contributed to that. On the other hand, in general, they were trying to mostly keep ruby 1.9 backwards compatible with ruby 1.8 (perhaps unlike Python 2/3). And in fact, the main reason this woudln't be true, and code written for 1.8 woudln't work on 1.9 -- was encoding.

So actually, the fact that they, in some cases (where all strings involved were strictly ascii) allowed you to ignore encoding problems -- might have actually been part of the success. Even though in other ways it actually makes encoding a lot harder to deal with it -- I think I'd agree with you that I'd prefer fail-fast, in the end, and not the current thing it does where, only in cases where all strings involved are pure ascii, it lets you get away with it.

But in the end, since the 1.8->1.9 transition was so successful, I guess we've got to say whatever they did was the right (or at least "a right") move.

I think switching to eliminate the "if all strings have exclusively ascii-compat chars" exception would actually be less disruptive at this point. But I could be wrong. And people were so burned by how difficult the 1.8->1.9 upgrade could be sometimes (largely because of encoding), there might be reluctance to touch it again any time close to soon.

It was _not_ an easy upgrade, although it may have been easier than python2->3, and it was possible to write libraries that would work in both (sometimes with special conditionals checking for ruby version -- especially around encoding!). I think the fact that Rails supported 1.9 very quickly (and then _stopped_ supporting 1.8 after that) is also huge, since Rails has a sort of unique place in ruby that even django doesn't have to python. I also think you are right that the ruby community is less change-averse than the python community (for better _and_ worse -- the ruby 1.9 and rails 3 transition was the beginning, for me, of starting to kind of hate how much work I had to do in rubyland just to keep everything working with supported versions of language and dependency).

There's actually way more we can say about this, but this is a huge book already, haha. One difference in encoding between python and ruby I think is, in ruby 1.9+, if a string is tagged with an encoding but contains bytes invalid for that encoding (that do not represent a legal sequence of chars), you'll get an exception if you try to concat it to anything else -- even a string of the same encoding. I don't _think_ that happens in python? Ruby also doesn't have a canonical internal encoding, strings can be in _any_ encoding it recognizes, tagged with that encoding and containing internal bytes in memory that are actually the bytes for that encoding (any one ruby knows about). I am not aware of any other language that made that choice -- I think it came about because of experience in the Japanese context, although at this point, I think anyone would be insane not to keep all of your in-memory strings in UTF-8 and transcode them on I/O, and kind of wish the language actually encouraged/required that. But, hey, I program in a U.S. context. And like I said, my experience dealing with encoding in ruby has been better than my experience in any other language I've worked in (and I do have to deal with encoding a lot in the software I write) -- I definitely like it better than Java, which did decide all strings had to have a canonical internal encoding (if only it wasn't the pre-unicode-consolidation "UCS-2"!! perhaps that experience, of choosing the in-retrospect wrong canonical internal encoding influended ruby's choice)

link

wycats 4195 days ago

I agree with the vast majority of what you've said.

One thing worth noting is that there was a TREMENDOUS effort that I headed up in the Rails 3 era to very aggressively attempt to reduce the number of encoding-related problems in Rails, and to make sure that common mistakes produced clear error messages.

I wrote two somewhat lengthy blog posts at the time[1][2] for a contemporary historical perspective just as the difficulty with encodings started to heat up.

One of the goals of the Rails 3 effort was to make significant efforts to ensure that strings that made their way into Rails came in as UTF-8. That involved being very careful with templates (I wrote a bit of a novel in the docs that remains to this day[3]), figuring out how to ensure that browser forms submitted their data in UTF-8 (even in IE6[4]), and working with Brian Lopez on mysql2 to ensure that all strings coming in from Postgres were properly tagged with encodings.

I also did a lot of person-to-person evangelism to try to get C bindings to respect the `default_internal` encoding setting, which Rails sets to UTF-8.

The net effect of all of that work is that while people experienced a certain amount of encoding-related issues in Rails 3, it was dramatically smaller than the kinds of errors we were seeing when experimental Ruby 1.9 support was first added to Rails 2.3.

---

P.S. I completely agree that the ASCII-7 exception was critical to keeping things rolling in the early days, but I personally would have liked an opt-in setting that would raise an exception when concatenating BINARY that happened to contain ASCII-7-only bytes with an ASCII-compatible string. In practice, this exception allowed a number of obscure C bindings to continue to produce BINARY strings well into the encoding era, and they were responsible for a large percentage (in my experience) of weird production-only bugs.

Specifically, you would have development and test environments that only tested with ASCII characters (people's names, for example). Then, in production, the occasional user would type in something like "José", producing a hard-to-reproduce encoding compatibility exception. This kind of problem is essentially eliminated with libraries that are encoding-aware at the C boundary that respect `default_internal`.

[1]: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer...

[2]: http://yehudakatz.com/2010/05/17/encodings-unabridged/

[3]: https://github.com/rails/rails/blob/master/actionview/lib/ac...

[4]: http://stackoverflow.com/a/3348524

link

berdario 4195 days ago

> There's actually way more we can say about this, but this is a huge book already, haha.

Yeah, so long for "I'll try to keep this short" :P

> One difference in encoding between python and ruby I think is, in ruby 1.9+, if a string is tagged with an encoding but contains bytes invalid for that encoding (that do not represent a legal sequence of chars), you'll get an exception if you try to concat it to anything else -- even a string of the same encoding. I don't _think_ that happens in python?

True, also strings in python are immutable, so unless there's some weird way to access the underlying char* with the CPython C Api, I don't think that you can have an invalid sequence of bytes inside an unicode string

(obviously you can have codepoint U+FFFD, if you set errors='replace' when decoding)

> Ruby also doesn't have a canonical internal encoding, strings can be in _any_ encoding it recognizes, tagged with that encoding and containing internal bytes in memory that are actually the bytes for that encoding (any one ruby knows about). I am not aware of any other language that made that choice -- I think it came about because of experience in the Japanese context, although at this point, I think anyone would be insane not to keep all of your in-memory strings in UTF-8 and transcode them on I/O, and kind of wish the language actually encouraged/required that. But, hey, I program in a U.S. context.

Yeah, some time ago I looked into the differences of Python/Ruby encoding, and I wrote down these notes that I just uploaded:

https://gist.github.com/berdario/9b6bd24cafe3817e4773

There are indeed some characters/ideograms that cannot be converted to unicode codepoints, but even if we try to obtain them, we westerners are none the wiser, since we cannot print them to our terminals in a utf-8 locale

About the edit you just added:

> I definitely like it better than Java, which did decide all strings had to have a canonical internal encoding (if only it wasn't the pre-unicode-consolidation "UCS-2"!! perhaps that experience, of choosing the in-retrospect wrong canonical internal encoding influended ruby's choice)

Yes, but I think that this issue is made more complex by Java's efforts to keep bytecode compatibility.

In a language like Python/Ruby, the bytecode is only an internal implementation detail, upon which you shouldn't rely (you should rely only on the semantics of the source code). If you keep the actual encoding of your unicode strings an internal implementation detail, this issue could've been avoided (without switching to linear time algorithms for strings handling):

Just migrate to UTF-32 (or to a dynamic fixed width encoding like in Python3.3) as the in-memory representation, when parsing strings from the source code, and everything would've continued to work.

I think that it had more to do with the Han unification, rather than with the fear of picking the "wrong encoding"

link

gsnedders 4195 days ago

> (obviously you can have codepoint U+FFFD, if you set errors='replace' when decoding)

Which is totally fine because U+FFFD REPLACEMENT CHARACTER is a totally valid character.

link