Hacker News new | ask | show | jobs
by InclinedPlane 6025 days ago
I'd say using null-terminated strings rather than pascal style length embedded strings is C's biggest mistake. Responsible for so many inefficiencies (strlen is O(n) instead of O(1) as it should be) and, worse yet, so many incredibly serious security vulnerabilities.

All to avoid having to incur a 1-3 byte per string overhead or figuring out how to efficiently work around a 255 character limit.

3 comments

Other than strlen, string operations can be faster with null-terminated strings. Plus, there are other benefits:

If you want to turn some arbitrary data into a string, just put a 0 byte where you want it to end.

Instead of having to increment and compare both a counter and a pointer (or store a final pointer for comparison) in string-manipulation operations, you just increment the pointer. It makes for very concise loops in strchr, etc.

Tokenizing a string is just a matter of throwing down 0 bytes where the tokens are (as strtok does).

Passing a tailing subset of a string to a function is as easy as adding an integer to the string pointer, rather than requiring a memcpy and length calculation.

Null-terminating arbitrary data to turn it into a string is great--if your arbitrary data doesn't have any 0 bytes in it naturally.

Strtok is just as easily handled by handling strings as a 2-item struct: a size_t for length and a pointer (separating the size data from the character array). In fact, that would kick ass: you can non-destructively tokenize, pass around substring arguments, etc.

Storing size separately from the character data, rather than as a prefix to the character data, would indeed alleviate some of the potential problems of a size+data string. I could see myself using such an implementation in situations where the minimal overhead of allocating a separate area on the stack or heap for the size+pointer struct is negligible. Expanding strings would certainly be a lot easier, and it would open the posibility (as you mentioned) for complex data sharing among strings, similarly to Qt4.
On modern processors simple arithmetic like incrementing is effectively free. Control dependence is expensive. Data dependence is somewhere between. Fat pointers do not add any dependence. Storing the end has little cost given shadow registers.

If this had been adopted 20 years ago the likely outcome would be that cpus might have fractionally more integer resources. It wouldn't affect the wallclock speed of any code.

All of the operations you describe are done just as easily with fat pointers with the additional advantage of being applicable to truly arbitrary data, not just null clean strings.

It is really an instance of the same problem. Walther Brights "fat pointer" proposal would give you length embedded strings for free.
Length-embedded strings are little better than the 0-terminated ones. You cannot take a substring without copying.

The "fat pointer" approach has been used in D for nearly 10 years now, and has proven itself to be very effective.

You cannot take a substring without copying.

That's not a bug, it's a feature. Indeed, immutable strings should almost always be the default choice due to security concerns alone. C is optimized for the rare case, modern languages with immutable strings (like C#) are optimized for the common case, this is how it should be.

I think you missed the point. Even with immutable strings, some representations will allow you to take a substring without copying data and some will not. Length-prefixed strings do not.