| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by MichaelGG 3694 days ago
	But the Rust version of the blank check doesn't handle any encoding but UTF-8. Helix could wrap up a char iterator for it I suppose, one that calls rb_enc_codepoint_len? And isn't there some common C lib that exposes Unicode functions like is_whitespace? Granted, using a cargo crate is easier than finding and adding a .h, and far easier than getting and linking another lib.

3 comments

wycats 3694 days ago

The Rust version does a type coercion from Ruby VALUE to Rust String. The type coercions are defined generically using Rust traits (see the Helix README) so once somebody defines RubyString -> String once everyone benefits.

In this case, the coercion needs to ask Ruby for the encoding tag and ask Ruby to validate the encoding (which is does often enough that it's often cached) but after that we can safely coerce directly into a UTF8 string.

If we wanted to support other encodings, we could fall back to using Ruby's transcoding support (string.encode("UTF8")) and again, once someone does the work once it'll work for all helix users.

MichaelGG 3694 days ago

I was just pointing out that the C and Rust versions provided weren't quite equivalent.

chancancode 3694 days ago

You are definitely correct, this is definitely a bug.

Helix is setup to do the right thing – it already goes through a coercion protocol, we can easily add the encoding check there. We just missed that detail when porting the code, will fix it soon.

I suppose that echoes my point about how system programming in is hard to get right, there are just too many details you have to remember!

This is why having a shared solution like Helix is beneficial. By moving all the unsafe code into a common library, it's more likely that someone will notice the problem and fix it for everyone.

This actually touches on an interesting point I would like to elaborate on. When we say {Helix/Rust/Ruby} is safe, there is an important caveat – {Helix/Rust/Ruby} themselves could of course have bugs. I have definitely experienced segfaults on Ruby myself.

While true, this caveat is not particularly interesting. It is not a slight of hand. Moving code around doesn't magically remove human errors, that's not the point. It's about establishing clear boundaries for responsibility. (This is why unsafe blocks in Rust is great.)

When you get a segfault on Ruby, you know for certain that your code is not the problem. Sure, you might be something weird, but it is part of the contract that the VM is not supposed to crash no matter what you do. As a result, memory safety is just not a thing you have to constantly worry about when programming in Ruby.

It is the same thing as saying JavaScript code on a website "cannot" crash the browser, segfaults in user-space code "cannot" cause a kernel panic or malicious code "cannot" fry your chip. All of these could of course (and do) happen – but from the programmer's perspective, you can work with the assumption that they are not going to happen (and when they do, it's someone else's fault). It's not "cannot" in the "mathematically proven" sense, but it's just a useful abstraction boundary.

steveklabnik 3694 days ago

Akira Matsuda actually suggested at RailsConf that maybe Rails handling non-UTF-8 encodings was not necessary, and maybe phasing it out was a good idea.

I wasn't present for the talk, just saw his slides.

nateberkopec 3694 days ago

Akira was talking about a specific context - view rendering. Which makes sense, who the hell ever renders a view in anything other than UTF-8?

Checking input, however, is a whole 'nother ballgame.

steveklabnik 3694 days ago

He was talking about the view layer, that's true. Even then though, your source is likely to be in UTF-8, and Rails' form helpers add

  <form accept-charset="UTF-8">

so these days, the non-UTF-8 usage in Rails apps should be pretty tiny, I would think? It'd be stuff coming from outside of forms.

aidenn0 3694 days ago

The Rust version works on strings not on bytes. Strings don't have encodings.

hetman 3694 days ago

What do you mean? All Rust strings are UTF-8 encoded, and all Ruby strings have an associated encoding.

steveklabnik 3694 days ago

All rust String and &strs are UTF-8 encoded, there are also other string types.

x5n1 3694 days ago

huh? strings have encodings. rust strings are bytes encoded in utf-8.

https://doc.rust-lang.org/book/strings.html

aidenn0 3694 days ago

In Rust a string is a sequence of unicode scalar values. I personally find it unfortunate that they dictate the storage of it at the API level, but that is a necessary evil for presenting a consistent ABI with foreign code.

I did not know that strings in Ruby have encodings. Is there a reason for that? I personally don't like mixing characters and opaque byte sequences as they are very different.

burntsushi 3694 days ago

> In Rust a string is a sequence of unicode scalar values.

The representation of a Rust String in memory is guaranteed valid UTF-8. To me, a "sequence of Unicode scalar values" is an abstract description, because it could be implemented via UTF-8, UTF-16 or UTF-32.

> I personally find it unfortunate that they dictate the storage of it at the API level

It is extraordinarily convenient and provides a very transparent way to analyze the performance of string operations.

For transcoding, there is the in-progress `encoding` crate: https://github.com/lifthrasiir/rust-encoding

I note that Go does things very similarly (`string` is conventionally UTF-8) and it works famously for them. They have a much more mature set of encoding libraries, but they work the same as the equivalent libraries would work in Rust: transcode to and from UTF-8 at the boundaries. See: https://godoc.org/golang.org/x/text

MichaelGG 3694 days ago

Ruby's Japanese heritage is probably why it handles encodings like that - I think there were multiple encs it had to deal with at once or something. Also Unicode doesn't completely handle all kanji in that there's some that have an old style not available in Unicode. But maybe that's not relevant.

aidenn0 3694 days ago

Unicode now handles all the Kanji in JIS. I wouldn't be surprised if Ruby predated that. It almost certainly predates good library support for all the Kanji in JIS.

GolDDranks 3694 days ago

I think the problem isn't whether it handles all the Kanji in JIS – it does. But the problem is that JIS at the time was so common that it didn't necessarily make sense to settle exclusively for then-less-used UTF-8. That would make re-encodings necessary at interfaces and on IO.

steveklabnik 3694 days ago

Ruby encoding stuff changed a lot over its history; it was one of the big changes from 1.8 to 1.9.

twelvechairs 3694 days ago

Its a better way of doing things - you can handle things in their native format rather than have to arbitrarily convert to UTF8 (which is an 'encoding' itself).

[edit] I remember a talk where Matz was asked this specific question and tried to explain it clearly but seemed confused as to how the questioner could have such a poor grasp of unicode (the difference between monolingual americans and japanese i guess)

kibwen 3694 days ago

String is just a typedef for Vec<u8> with some extra convenience functions for working with UTF-8. There's nothing stopping anyone from just using Vec<u8> to handle non-UTF-8 data in their native format, nor stopping anyone from writing convenience types like String for other encodings.

twelvechairs 3694 days ago

Yeah right so Ruby effectively has just made a bunch of these (and done the hard work for you of defining how to convert between them and work with them all in similar ways), and the higher-level class which includes UTF8 and a whole bunch of others is called 'String'. Its really what you want from a high-level language - to just work with different encodings out of the box, but not have to convert to a standard interal type (like UTF8) to do so.

lobster_johnson 3694 days ago

The reason is that Ruby supports non-Unicode encodings that are not subsets of Unicode. Not possible if your string is Unicode.

MichaelGG 3694 days ago

Right, so how do you get from Ruby strings (various encodings) to a Rust string? The sample code just calls std::str::from_utf8_unchecked(s) which is obviously not dealing with Ruby encodings.

aidenn0 3694 days ago

Yeah, that's a clear bug. I was not aware that Ruby strings had encodings.

x5n1 3693 days ago

All strings have encodings. It is not possible to represent a string which is a series of bytes except with encodings. I guess you probably mean default encoding or no encoding support... which implies ASCII, better known as US-ASCII.

aidenn0 3692 days ago

In the high-level language I am most familiar with (Common Lisp), strings do not have encodings because they are vectors of characters, not vectors of bytes. How the string is actually stored in memory is an implementation detail.

Encoding is purely an artifact of I/O if your language has a character type that can represent all possible characters you might want read or write.

Rust's strings are almost this; if there were no way to get a string's raw representation, nor perform bytewise slices, then how the string was stored in RAM would be an implementation detail rather than part of the public API. Rust, being a systems language, probably does need to specify this so that it doesn't incur encode/decode overhead when dealing with foreign code that can understand utf-8.