Internally strings will either be UTF-16, or if a string can be represented in LATIN-1 it may use a more compact representation.
JEP400 is about I/O, previous to this change when you create something like a FileWriter without specifying the charset the platform default would be used. For a long time this has been recognized as a common foot gun, hence this change to a default that is more likely to be what the developer actually wants.
UTF-16 represents a fairly reasonable compromise, not sure what your disgust is for.
UTF-32 (with no BMP concept) doubles the memory usage of most international text and quadruples the memory usage of ASCII text (which is the most common), yet characters outside the BMP are barely used outside of emoji.
Native UTF-8 in memory makes character indexing a non-constant time operation, which would bite people badly in cases where they've written a loop over the indexes. This is of course the point at which you say, ah but what is a character exactly. If you go down this route you end up with Swift and Emoji Flag Calculus classes. The string APIs become incredibly convoluted or inefficient for the common cases. It hardly seems worth any kind of backwards compatibility break for this.
So Java does the pragmatic thing: String can switch between 8 or 16 bits per "character" and this is basically always good enough. If you care about woring with emoji or Egyptian hieroglyphs in memory, then you either have to deal with combining characters or just bite the bullet and decode to UTF-32.
> Native UTF-8 in memory makes character indexing a non-constant time operation
The only reason that Java's UTF-16 has constant time indexing is because they use a braindead definition of character which is "UTF-16 codepoint".
If you want constant time character indexing you need to go UTF-32. But obviously the downsides are too great for most users. So in practice everyone uses UTF-8 because it is usually the most memory efficient.
Plus it turns out that character indexing isn't actually that common of an operation, so it is really the right move for almost every application.
But in practice the Java definition of a character basically always works, because characters that aren't in the BMP are vanishingly rare in real software outside of emoji, and of course, Java long pre-dates emoji.
Yes, that won't change anything internal. However:
> ...JVM's internal string representation is UTF-16
Hasn't been try for a while. They switched to using a byte array internally for storage, plus an encoding. Currently that's either UTF-16 or Latin 1, unless compact strings are disabled in which case it's all UTF-16.
You're talking about implementation details of java.lang.String. The interface it exposes is still UTF-16.
Latin 1 has the special property that each of its fixed-width code units maps onto a single UTF-16 code unit. It is for that reason alone that CharSequence implementors can use it as an alternative to UTF-16. Imagine trying to implement `char charAt(int index)` if you're backed by a UTF-8 byte array (or UTF-32, for that matter)!
From a programmer's perspective, Java is pretty much as UTF-16 as ever.
JEP400 is about I/O, previous to this change when you create something like a FileWriter without specifying the charset the platform default would be used. For a long time this has been recognized as a common foot gun, hence this change to a default that is more likely to be what the developer actually wants.