That'd be catastrophic for performance in Ruby - every string allocation would always have to be reified, and would always need to access a shared data structure.
I'll take your word for that. It's the opposite in Lua, which is several times faster than stock Ruby. If your runtime takes it as a given that every string will be interned, there are all sorts of assumptions this enables which mutation invalidates.
> If your runtime takes it as a given that every string will be interned, there are all sorts of assumptions this enables which mutation invalidates.
Yes but if you're regularly creating strings, which is what Ruby web servers do all the time, then your intern table is going to become a white-hot hotspot, contended by all threads all the time.
It's more accurate to say that Ruby has mutable strings as part of its semantics, and that can't be changed, and would interact disastrously with immutability.
Languages with immutable strings simply architect a web server around that semantic, the problem you're describing doesn't happen. A mutable string is replaced with (in Lua) a mutable array containing immutable string fragments, which are made into a string using the builtin table.concat.
> A mutable string is replaced with (in Lua) a mutable array containing immutable string fragments, which are made into a string using the builtin table.concat.
That's what TruffleRuby does internally! We give you 'mutable' Ruby strings, but really they're made up of fragments of immutable strings.
Yeah - if all strings are interned that's ok for comparison, but atrocious for string allocation. If not all strings are interned that's better for allocation, but atrocious for string comparison.
We already have the best solution... symbols. An alternative set of semantics that work well with our hardware.
Always measure. I use Lua at $job, where it's processing millions of SIP messages (text! Lots of parsing) in real time (it's part of the call processing flow), and over the past seven years or so, I've only had to tune the code once in that time (maybe two, three years ago now?).
Honestly, many more language runtimes should perform a lazy version of interning: have String.equals(String) (or its equivalent in your language), when it determines that two Strings are equal, fix up the String internal representations to share more state and hit the fast path for future String comparisons.
In older Sun/Oracle JVMs, String looked something like
class String {
private char[] contents;
private int length; // substring() used to share contents with parent string
private int offset;
}
Since they switched to memoizing hashCode calculation, it looks something like
class String {
private char[] contents;
private int length; // Allows Substrings to share contents with parents if they are a prefix of the parent
private int hashCode;
}
The fast path for equals() should first check for address equality between the two Strings, and if that fails, check if length and contents are equal, and if that fails, have a fast path for returning false if they both have memoized non-equal hash codes. If both strings compare as equal in the slow path, then one of them should be patched up to save memory and also hit the fast path next time equals() is called on them.
When two Strings compare as equal, you probably want to use some heuristic to determine which one is likely to live longer into the future, and switch the other String to use the former's contents. Maybe in the Java case you'd only enable this optimization if length == contents.length to avoid some trickiness around prefix sharing optimizations. For a mark-sweep-compact garbage collector where lower addressed objects are more likely than not to have been allocated earlier, using the lower addressed contents as a tie breaker is a decent heuristic. Even in cases where address comparison is only as good as a coin flip in guessing age, it's a decent arbitrary tie breaker that in the long run leads to equal strings all converging on sharing the same contents char[].
Note that maximum String length in Java is 2*31-1 chars, so the sign bit of String.length is free for use as a flag for either keeping track if hashCode is memoized (if you don't want to use a sentinel value to indicate non-memoized hashCode) or if this String's contents should be the preferred version (for instance, if its backing store is in a shared library's read-only data section).
In languages with mutable strings, if maximum string lengths are also 2*31-1 characters (or if String hash codes can be safely truncated to 31 bits, etc.), then a single bit could be found to keep track of which contents arrays are copy-on-write.