Hacker News new | ask | show | jobs
by david2ndaccount 72 days ago
Pascal strings might be the only string design worse than C strings. C Strings at least let you take a zero copy substring of the tail. Pascal strings require a copy for any substring! Strings should be two machine words - length + pointer (aka what is commonly called a string view). This is no different than any other array view. Strings are not a special case.
5 comments

Yeah, I too feel that storing the array's length glued to the array's data is not that good of an idea, it should be stored next to the pointer to the array aka in the array view. But the thrall of having to pass around only a single pointer is quite a strong one.
> I too feel that storing the array's length glued to the array's data is not that good of an idea, it should be stored next to the pointer to the array aka in the array view.

That’s not cache-friendly, though. I think the short string optimization (keeping short strings alongside the string length, but allocating a separate buffer for longer strings. See https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=10... for how various C++ compilers implement that) may be the best option.

> That’s not cache-friendly, though.

How so? The string implementations in that post are pretty much that:

    struct string
    {
        char* ptr;
        size_t size;
        union {
            size_t capacity;
            char buf[16];
        };
The pointer and the size are stored together, and they may optionally be located right next to the string's actual data, but only for very small, locally-allocated, short-lived strings; but in normal usage, that pointer points somewhere into the heap.
> they may optionally be located right next to the string's actual data, but only for very small, locally-allocated, short-lived strings

Only for small strings. Locally allocated and short-lived aren’t required for short string optimization to take an effect.

Also, I can’t find a good reference, but “only for small strings” in many programs means “for most strings”.

Is there a reason for the string not to be a struct, so that you're still just passing around a pointer to that struct (or even just passing it by value)?
I might guess that GP is referring not to interface ergonomics (for which a struct is a perfectly satisfactory solution, as you describe), but to implementation efficiency. A pointer is one word. A slice / string view is two words: a length and a pointer. A pointer to a slice is one word, but requires an additional indirection. I personally agree that slices are probably the best all-around choice, but taking double the memory (and incurring double the register pressure, etc.) is a trade-off that's fair to mention.
> C Strings at least let you take a zero copy substring of the tail

This is a special-case optimisation that I'm happy to lose in favour of the massive performance and security benefits otherwise.

Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.

I think what was unsaid in your second point is that we really need to type-differentiate constant strings, dynamic strings, and string 'views', which Rust does in-language, and C++ does with the standard library. I prefer Rust's approach.

If I recall correctly a pascal string has the length before the string. Ie to get the length you dereference the pointer and look backwards N bytes to get the length. A pascal string is still a single pointer.

You cannot cheaply take an arbitrary view of the interior string - you can only truncate cheaply (and oob checks are easier to automate). That’s why pointer + length is important because it’s a generic view. For arrays it’s more complicated because you can have a stride which is important for multidimensional arrays.

> Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.

Length + pointer is a record string, a pascal string has the length at the head of the buffer, behind the pointer.

Many years ago when reading Redis code I saw the same pattern: they pass around simple pointer to data, but there is a fixed length metadata just before that.
I assume it’s either Antirez’s sds or a variant / ancestor thereof, yes. It stores a control block at the head of the string, but the pointer points past that block, so it has metadata but “is” a C string.
Pascal strings store the string's length by its data, whereas fat pointers store the length by the address of the data.

The main difference is that if a string's length is by its data, you can't easily construct a pointer to part of that data without copying it into a new string, whereas if instead the length is by the data's address, you can cheaply construct pointers to any substring (by coming up with new length+address pairs) without having to construct entire new strings.

C strings also allow you to do a 0 copy split by replacing all instances of the delimeter with null (although you need to keep track of the end-of-list seperatly).
You also need to own the buffer otherwise you’re corrupting someone else’s data, or straight up segfaulting.
As long as you clearly document that the incoming data is going to be modified, it's not a problem. And in a lot of cases, the data either comes from the network or is read from the file - so the buffer is going to be discarded at the end anyway... why not reuse it?

And yes, today it would be easier to make a copy of the data... but remember we are talking about 90's, where RAM is measured in megabytes and your L1 cache may be only 8KB or so.

The "zero copy substring" in C is in general not a valid C string since it is not guaranteed to be zero-terminated. For both languages one could define a string view as a struct with a pointer plus size information. So, I do not see why Pascal is worse in this regard than C.
x86 had 6 general-purpose working registers total. Using length + pointers would have caused a lot of extra spills.
“Sure your software crashes and your machines get owned, but at least they’re not-working very fast!”
Right. This is so often the excuse for terrible designs in C and C++. It's wrong, "But it's faster". No, it's just wrong, only for correct answers does it matter whether you were faster. If just any answer was fine there's no need to write any of this software.