Hacker News new | ask | show | jobs
by corresation 4566 days ago
This all sounds rather terrible for Ruby, doesn't it? It isn't so much that the short string is faster (though I'm left unclear whether it itself is on the stack/heap, though given the GC nature of Ruby and practical considerations of the language, it must be the heap), but rather that the cost of the short string is also added to the long string in the heap (assumed) allocation of the RString (which becomes larger and thus more difficult to malloc).

If this is intended to sit on the stack, which I find highly unlikely (especially given the timings that seem to be the delta between one malloc and two, and would be much more significant if it were a stack allocation versus a heap allocation. This is not comparable to small string optimizations for the stack in C++), maybe. But otherwise it seems like a poorly considered hack.

The string type could as easily have been dynamically allocated based upon the length of the string, where the ptr by default points inside that same allocated block. If the string is expanded it can then be realloced and the string alloced somewhere else. No waste, a single allocation, etc.

2 comments

    allocation of the RString (which becomes larger and thus more difficult to malloc), and the 23 string bytes that will sit unused for longer strings.
I got the distinct impression (backed up by an actual code snippet defining the max embedded string size) that the 23 byte limit was calculated to exactly match the size of the data that would otherwise have to be stored for a heap string anyway. Thus, it doesn't actually take any extra space in the struct, and those 23 bytes do not go unused in other strings.
You're exactly right on the union: I tried to edit out my error on that before someone noticed (my principal point is about the mallocs), but your comment appeared right as I saved. Shame be upon me.

So this holds (on a 64-bit machine) 8-bytes for the pointer, 8-bytes for the length (string not null terminated), and 8-bytes for the capacity. Alternately, via a union, it stores 24 bytes of string (null terminated). It knows whether it is a or b via a separate flag that it holds separately in RBasic.

I retract my jab about memory loss, but it still sounds rather terrible. Every bit of code dealing with strings needs to validate flags on every use to determine what it is dealing with, alternate between length specified or null terminated, etc. Ugh.

Look up a C++ string implementation sometime, and you'll find that this is almost exactly how most efficient C++ string implementations do this too.

And it doesn't need to alternate between length specified or null terminated - they're all length specified (or how do you think short Ruby strings are also able to store ASCII NUL)

> but rather that the cost of the short string is also added to the long string

The key point is that for short strings, you're not adding it to the cost of the long string, but instead overwriting the data that you would track the long string.

In other words, the memory layout you get is this:

     short:
       [Rbasic]['s', 't', 'r', 'i', 'n', 'g', '\0']
     long:
       [Rbasic][ length ][ dataptr ][capa or value]
Which leads to another interesting question: How does the optimization interact with null in the middle of short strings, since only long strings have a length stored? Does Ruby check for embedded null when creating a string, and disable this optimization?