| This is so interesting! I wonder if anyone has ever tried an implementation that prioritizes even more minimal memory usage for small strings? Of the three implementations discussed, the smallest (on a 64-bit system) is 24 bytes. How about 8 bytes: the union of a single pointer and a char[8]? For strings of 6 or fewer characters, it uses the first 6 bytes to store the string, one for the null terminator, and the last byte to store the length, but with a trick (see below). For strings of 7 or more characters, it uses all 8 to store the address of a larger block of memory storing the capacity, size, and string bytes. The trick is to use the last byte as a way to know whether the bytes are encoding a short string or a pointer. Since pointers will never be odd, we just store an odd number in that last byte. For example, you could store (size << 1) | 0x1 So if the last bit is 1, it's a short string. The size is bytes[7] >> 1, and the data() pointer is just equal to the string itself. If the last bit is 0, treat the whole data as a pointer to a structure that encodes the capacity and size and then string bytes as usual. |
Generally when you do memory compactification of non-serialized structures it’s for cache related performance improvements, so unless the common case is overwhelmingly small strings there’s probably not a benefit. It’s worth testing to see where this is an improvement, but my guess is that it’s in extremely limited regimes, but maybe some bit manipulations instead of traditional byte scan could yield improvements, likewise it may be worth going to larger SSO sizes for SIMD and a larger scope of SSO cases.
If you’re interested in data structure optimizations like this there’s a cppcon talk about Facebook’s string implementation that has some clever tricks.