Hacker News new | ask | show | jobs
by desdiv 1544 days ago
>JEP400: UTF-8 by Default

This changes the default charset of the Java APIs to UTF-8.

I read that the Java 8's JVM's internal string representation is UTF-16 [0][1]. Is that still the case after JEP400?

[0] https://docs.oracle.com/javase/8/docs/technotes/guides/intl/...

[1] http://tutorials.jenkov.com/java/strings.html

4 comments

Internally strings will either be UTF-16, or if a string can be represented in LATIN-1 it may use a more compact representation.

JEP400 is about I/O, previous to this change when you create something like a FileWriter without specifying the charset the platform default would be used. For a long time this has been recognized as a common foot gun, hence this change to a default that is more likely to be what the developer actually wants.

Since Java 9/11 (9 is not a LTS), String internals was reworked to use either 8 bits or 16 bits per char. [1]

[1] https://openjdk.java.net/jeps/254

BMP forever! The most disgusting thing I've read today.
Am I misunderstanding? UTF-16 can represent all Unicode characters, not just the BMP.
UTF-16 represents a fairly reasonable compromise, not sure what your disgust is for.

UTF-32 (with no BMP concept) doubles the memory usage of most international text and quadruples the memory usage of ASCII text (which is the most common), yet characters outside the BMP are barely used outside of emoji.

Native UTF-8 in memory makes character indexing a non-constant time operation, which would bite people badly in cases where they've written a loop over the indexes. This is of course the point at which you say, ah but what is a character exactly. If you go down this route you end up with Swift and Emoji Flag Calculus classes. The string APIs become incredibly convoluted or inefficient for the common cases. It hardly seems worth any kind of backwards compatibility break for this.

So Java does the pragmatic thing: String can switch between 8 or 16 bits per "character" and this is basically always good enough. If you care about woring with emoji or Egyptian hieroglyphs in memory, then you either have to deal with combining characters or just bite the bullet and decode to UTF-32.

> Native UTF-8 in memory makes character indexing a non-constant time operation

The only reason that Java's UTF-16 has constant time indexing is because they use a braindead definition of character which is "UTF-16 codepoint".

If you want constant time character indexing you need to go UTF-32. But obviously the downsides are too great for most users. So in practice everyone uses UTF-8 because it is usually the most memory efficient.

Plus it turns out that character indexing isn't actually that common of an operation, so it is really the right move for almost every application.

UTF-32 isn't really a solution either, unless you consider a scalar value to be a character; I bet almost nobody wants U+0308 to be "a character"...
But in practice the Java definition of a character basically always works, because characters that aren't in the BMP are vanishingly rare in real software outside of emoji, and of course, Java long pre-dates emoji.
Yes, that won't change anything internal. However:

> ...JVM's internal string representation is UTF-16

Hasn't been try for a while. They switched to using a byte array internally for storage, plus an encoding. Currently that's either UTF-16 or Latin 1, unless compact strings are disabled in which case it's all UTF-16.

You're talking about implementation details of java.lang.String. The interface it exposes is still UTF-16.

Latin 1 has the special property that each of its fixed-width code units maps onto a single UTF-16 code unit. It is for that reason alone that CharSequence implementors can use it as an alternative to UTF-16. Imagine trying to implement `char charAt(int index)` if you're backed by a UTF-8 byte array (or UTF-32, for that matter)!

From a programmer's perspective, Java is pretty much as UTF-16 as ever.

no, it's only for IO. It lead to way too much breakage if they changed how the string behaves.

I generally like the idea to use UTF-8 strings, but if they didn't want to break string indexing, the indexing would take O(str.length)...