Hacker News new | ask | show | jobs
by lelanthran 1615 days ago
I got tired of running into this problem and decided to simply eat the cost of using `char *` in my string library.
1 comments

And that is why most such efforts eventually die.

WG14 could naturally work into something like SDS for strings and arrays, but of course that is out of their goals to ever do that.

> WG14 could naturally work into something like SDS for strings and arrays, but of course that is out of their goals to ever do that.

Maybe it is, but even if it were, sds strings are a poor choice. I used them extensively in a private project.

1. Typedef'ing `sds` to a pointer type. This leaves no indication to the reader of code that any `sds` typed variable needs an `sdsfree`. IOW, for every other standard type it is clear when the data object needs a `free`, `fclose`, etc. This is a big deal, it's difficult to change the typedef for sds due to the way it returns pointers.

2. Not compatible with current string functions, strike 1: storing binary data in the strings, like the nul character makes it silently lose data when used with current string functions that accept `const char *`. This is a very big deal!

3. Not compatible with current string functions, strike 2: an sds string is only compatible with current string functions that take a `const char *`. This isn't such a big deal (for example, it provides a replacement for `strtok` as the standard `sds` type won't work for `strtok`) but it's unnecessarily incompatible.

4. With the current way it's exposed to a caller, you cannot use `const sds` variables anywhere, which removes a lot of compiler-checking. Trying to use `const` on any sds variable is pointless as you get none of the error-checking.

While sds solves many problems with raw C strings, those problems can be solved by adding standard library functions that work with existing C strings. In addition, it adds a few more problems of its own.

"C strings" really aren't anything worth talking about. People take them way too seriously and then complain that they are "unsafe" or "hard to use". Look, C gives you memory to work with and the rest is up to you. Almost the only thing you want from C with regards to strings is string literals.

It should be obvious that most "string" APIs from libc like strcat, strcpy, but especially strtok are ridiculously bad and are only in the libc because of history. Don't use them.

Even strlen() is rarely a good idea to use, and you can (should?) replace strlen("abc") by sizeof "abc" - 1.

My point regarding WG14 wasn't to add SDS as they are, rather vocabulary types for strings and arrays in the same spirit as SDS.

When they exist as vocabulary types, the ecosystem can rely on their existence and slowly adopt their use, similarly to threads support introduction in C11, for example.

> My point regarding WG14 wasn't to add SDS as they are, rather vocabulary types for strings and arrays in the same spirit as SDS.

Well, yes, I'd love to see some proper string support too, so at least we're in agreement about that :-)

But, overhauling C with additional (memory-safe) array types and string types that are nonetheless still compatible with legacy uses is probably a non-starter anyway. The only way forward would be to add a new type that isn't compatible, which is unpalatable to a lot of people (myself included).

Adding memory-safe functions and/or semantics is easier, but will probably not cover 100% of the memory-safety desired.

> When they exist as vocabulary types, the ecosystem can rely on their existence and slowly adopt their use, similarly to threads support introduction in C11, for example.

Threads, I feel, are a poor example for two reasons: 1) Hardly any code uses the `thread_t` type for a variety of reasons, and 2) There was no need for a `thread_t` type to be backward compatible with anything.

For full memory safety with C the only option are the C Machines, meaning hardware memory tagging.

Already in use for a decade in Solaris SPARC, and eventually mainstream across all variations of ARM CPUs.

Unfortunely Intel botched their MPX implementation and now it is gone.

Apart from plain old fixed buffers, which is what is supported by C just fine and which covers 99% of string processing needs in the areas that C as a language is suited for anyway, ... there are 14 known ways of doing "strings" depending on circumstance, so I don't think it would be a good idea to introduce one mandatory version of them into the C standard. There is already C++ which has std::string, and there are a lot of GC'ed and scripting languages that are more suited for quick and dirty string processing.
The fact that C++ was able to eventually standardize on a single string type (despite the same mess of many dozens of incompatible implementations) shows that it is possible and desirable. It's not like raw buffers will go anywhere if you add a higher-level type. Nor does it have to be perfect - only "good enough" for use across the API surfaces of various libraries.
Just because it's possible to standardize on a string type in C it doesn't mean it's desirable. Also consider that it's not possible to copy C++'s string type because its ergonomics build heavily on RAII.

'const char *' arguments work just fine as parameters in libraries, and I don't see much of a use case (and insteaad more hazards) for a library that "resizes" a string argument destructively (like std::string does). The typical way to go about this is for the library to make a copy of the input string. On API boundaries, for memory that is needed longer than the function call lifetime, it is almost always an excellent idea to simply copy it. For data that doesn't make sense to copy (be it because of size or because only one side really needs it), the data should instead simply be created on the right side of the fence from the beginning.

I don't see myself needing a standardized string type because I'm not passing around string "objects", or concatening them, like it would be done in quick and dirty scripts. I honestly can't recall where that kind of thing would have been a good idea for my work in the last couple of years, and I'm much in favour of not growing the standard out of proportion. As said, if you desire C++ kind of ergonomics and want to solve more scripting-like tasks, there is already C++ and a ton of other languages.

What I can recall is skimming through a lot of C projects over the years that tries to do object-oriented and scripting type programming in C (often it wanted to be C++ or Java or even Python but it had to be C for some external reason), and that code is always, invariably, an unmaintainable mess where it's impossible to have a level of confidence that there are no memory errors and leaks. C is simply not suited for that style of programming.