Hacker News new | ask | show | jobs
by WalterBright 3208 days ago
That falls over as soon as you integrate with anybody else's C code, including the operating system APIs, and with C string literals :-(

If it was as easy as you say, it would have happened.

And heaven knows I wrote my own string packages, one after the other, and so did everyone else. I eventually abandoned all of them. C's abstraction abilities are simply not good enough to do a decent string encapsulation.

2 comments

No other language solves this perfectly either, certainly not in a way that interoperates _across_ languages and environments.[1] Which is pretty much the whole point of the article. But what C excels at is the ability to write code which can examine and work with the representation of most string-like objects exported from any environment. The difficulty of doing so is a function of how opaque and complex the alien implementation.

I gave up on trying to solve strings in C applications a long time ago, too, much as you have. I did so not because I found C too inexpressive, but because I realized that I was trying to shoe-horn too many concepts into a "string". A string is almost by definition the wrong data structure--either too abstract or not abstract enough--for almost everything. Not coincidentally, that was about the same time I stopped abusing regular expressions for parsing data.

[1] Even C++ didn't solve this. We're still in the midst of a std::string ABI compatibility break in the C++ ecosystem. Granted, it's been about 12 years since the last one, but these last fairly long because systems software (i.e. infrastructure software) has a really long tail.

Not to mention that in C++ there are plenty of string implementations predating std::string (e.g QT's QString, ROOT's TString)
shrug It doesn't fall over. I've done it, the openBSD team has done it. DJB has done it. Maybe something is wrong with your implementation that I can help you with?
I'm curious. Got links?
OpenBSD takes a fairly minimalist approach, which is vaguely described here: http://www.freebsdforums.org/forums/showthread.php?threadid=... They basically replace the unsafe functions with things that are easier to use. Their idea is that it isn't the format of the C-string that causes security issues (null-terminated string), it's the poorly defined functions (with weird corner cases that are hard to get right). It's worked well for their use cases.

DJB did something similar in qmail, I don't recall the details but you can look at the source code as easily as I can, and it eliminated security problems.

When I'm working in Java, I find that most of my string parsing uses the split() function. This is a pain in C, because even if you had a split() function you'd need to deal with memory allocations. Most of these are solved with a memory pool. In my own library, I also added runtime, grammar-based parsing functionality. So to parse a CSV line you might do something like this:

    char *g = " S   -> WORD | WORD , S;"
              "WORD -> [^,]";
    results = parsegram(g, inputString);
Grammar parsing + memory pools makes string parsing in C easier than in Java. The biggest difficulty with this kind of library is to do it right, you need to be something of a unicode expert, and that's tough.
I used snprintf(), too, but it is only a minor improvement. Problematic in C is something as simple as concatenating strings:

    Mystring s,t;
    t = "hello";
    t = cat(s,s);
    t = cat(s,s,s);
    t = cat("hello",s);
    t = cat(s,"world");
    t = cat("hello","world");
Even such a simple use case is fraught with major problems:

1. who allocates needed memory?

2. who free's it?

3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?

4. what about the lack of function overloading to handle the permutations?

Here's roughly what that would look like using Bernstein's C string library (which was not only used in qmail).

    #include "stralloc.h"
    ...
    static stralloc s, t;
    ...
    if (!stralloc_ready(&s, 0)) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cats(&t, "hello")) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();
    if (!stralloc_cats(&t, "world")) die_nomem();
Yes, that does work. But it's not without problems, not the least of which it's just not attractive to look at. For example, concatenating "hello" and "world" allocates memory, when it should instead give you a "helloworld" string literal. In fact, simply initializing `s` with a string literal needlessly allocates memory, and that's anti-ethical to performance. Calling die_nomem() leaks memory if it does anything but terminate the program. All those tests for memory exhaustion are tedious.
> Even such a simple use case is fraught with major problems: > > 1. who allocates needed memory? > > 2. who free's it?

That's also a major feature. It allows people to write systems that are resilient in the face of tight memory limitations. It's not cool when a language forces string operations to allocate & duplicate memory willy-nilly.

> 3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?

I fail to see how that's a major problem. Why are you concatenating string literals? How common is that?

> 4. what about the lack of function overloading to handle the permutations?

I consider lack of overloading to be a feature. Overloading is one of the things that are way too easily abused, and it makes code auditing harder than it needs to be. Please just type out the different function names so I can see exactly what is going to be called when I read the code. Or use the sprintf family of variadic functions.

It's the opposite. I've seen lots of code written in C that pretends to be out of memory safe. I've never once seen such a program that actually is out of memory safe. Invariably the codepaths triggered by malloc returning null are never exercised.

With a GC and exceptions you can theoretically be quite resistant to OOM conditions, not that anyone really cares.

I assume you're referring to OpenBSD here, they didn't use snprintf(). They used asnprintf(), which solves the problem of who should allocate (but not who should free).
From the link:

"That means that we have been going through the tree cleaning out all calls to sprintf(), strcpy(), and strcat(). Instead, these things are being rewritten to use asprintf(), snprintf(), strlcpy(), and strlcat()."

Maybe the author made a typo.