| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ktRolster 3208 days ago
	That particular problem (strlen/strcpy/memcpy) comes from the problems of the standard library string functions. It can be solved by creating your own string library. Then string manipulation is easy.

2 comments

rurban 3207 days ago

This problem was actually solved, but almost nobody uses it. Safe variants of most of those string, memory, io, wchar, stdlib and misc functions are defined in the C11 standard Annex K (finally after 9 years), but nobody is using it, and rather propose to keep using known unsafe variants like the truncating versions with an n. Like snprintf and not the safe variant sprintf_s. glibc, bsd, darwin, musl, newlib: nobody cares to implement the safe bounds checking variants. They solely rely on the compile time size checks, which fail to check any dynamic boundaries. Only Microsoft, Android, Cisco and Embarcadero implement the safe libc functions. I recently took over Cisco's safelibc (MIT licensed) and extended it to more platforms, all C11 api's, and an improved testsuite. And boy was I surprised to find so many missing API's, upstream libc bugs and wrong API's everywhere. Flawless were only musl and the BSD's. But musl is lacking with it's errno and of course zero C11. Only ReactOS has a proper testsuite for their libc. Glibc is somewhat ok, but I still find crashes daily.

https://github.com/rurban/safeclib

So why is nobody else implementing C11? I'll write a blog post when I finished my C11 efforts. Maybe at least FreeBSD will take it then.

pjmlp 3206 days ago

Annex K is not safe, just pretends to be.

By tracking the pointer and sizes as separate function arguments, the possibility of mixing parameters, leading to memory corruption is still there.

This is the major motivation why almost nobody uses it and it was made into an optional annex.

rurban 3204 days ago

No. The major motivation not to use it was _FORTIFY_SOURCE with it's compile checks for compile-time known buffer sizes and it's accompanying _chk functions. This leaves out all dynamic buffers.

You cannot mix PTR + LONG args without serious compile-time errors

pjmlp 3204 days ago

I don't have any idea how _FORTIFY_SOURCE works, other than it is GCC specific and as such no place in ANSI C.

What I know is that having something like strcpy_s() does not provide any actual safety, because with the prototype "strcpy_s(char * restrict s1, rsize_t s1max, const char * restrict s2)" there is no guarantee that s1max is a valid size for s1.

rurban 3202 days ago

This is what the _chk functions do. In most cases it know the compile-time size of s1. But in dynamic cases the _s functions are far better than the truncating 'n' versions. Read the rationale.

rurban 3207 days ago

Done here: https://rurban.github.io/safeclib/doc/safec-3.0/d1/dae/md_do...

WalterBright 3208 days ago

That falls over as soon as you integrate with anybody else's C code, including the operating system APIs, and with C string literals :-(

If it was as easy as you say, it would have happened.

And heaven knows I wrote my own string packages, one after the other, and so did everyone else. I eventually abandoned all of them. C's abstraction abilities are simply not good enough to do a decent string encapsulation.

wahern 3208 days ago

No other language solves this perfectly either, certainly not in a way that interoperates _across_ languages and environments.[1] Which is pretty much the whole point of the article. But what C excels at is the ability to write code which can examine and work with the representation of most string-like objects exported from any environment. The difficulty of doing so is a function of how opaque and complex the alien implementation.

I gave up on trying to solve strings in C applications a long time ago, too, much as you have. I did so not because I found C too inexpressive, but because I realized that I was trying to shoe-horn too many concepts into a "string". A string is almost by definition the wrong data structure--either too abstract or not abstract enough--for almost everything. Not coincidentally, that was about the same time I stopped abusing regular expressions for parsing data.

[1] Even C++ didn't solve this. We're still in the midst of a std::string ABI compatibility break in the C++ ecosystem. Granted, it's been about 12 years since the last one, but these last fairly long because systems software (i.e. infrastructure software) has a really long tail.

cozzyd 3208 days ago

Not to mention that in C++ there are plenty of string implementations predating std::string (e.g QT's QString, ROOT's TString)

ktRolster 3208 days ago

shrug It doesn't fall over. I've done it, the openBSD team has done it. DJB has done it. Maybe something is wrong with your implementation that I can help you with?

WalterBright 3208 days ago

I'm curious. Got links?

ktRolster 3208 days ago

OpenBSD takes a fairly minimalist approach, which is vaguely described here: http://www.freebsdforums.org/forums/showthread.php?threadid=... They basically replace the unsafe functions with things that are easier to use. Their idea is that it isn't the format of the C-string that causes security issues (null-terminated string), it's the poorly defined functions (with weird corner cases that are hard to get right). It's worked well for their use cases.

DJB did something similar in qmail, I don't recall the details but you can look at the source code as easily as I can, and it eliminated security problems.

When I'm working in Java, I find that most of my string parsing uses the split() function. This is a pain in C, because even if you had a split() function you'd need to deal with memory allocations. Most of these are solved with a memory pool. In my own library, I also added runtime, grammar-based parsing functionality. So to parse a CSV line you might do something like this:

    char *g = " S   -> WORD | WORD , S;"
              "WORD -> [^,]";
    results = parsegram(g, inputString);

Grammar parsing + memory pools makes string parsing in C easier than in Java. The biggest difficulty with this kind of library is to do it right, you need to be something of a unicode expert, and that's tough.

WalterBright 3208 days ago

I used snprintf(), too, but it is only a minor improvement. Problematic in C is something as simple as concatenating strings:

    Mystring s,t;
    t = "hello";
    t = cat(s,s);
    t = cat(s,s,s);
    t = cat("hello",s);
    t = cat(s,"world");
    t = cat("hello","world");

Even such a simple use case is fraught with major problems:

1. who allocates needed memory?

2. who free's it?

3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?

4. what about the lack of function overloading to handle the permutations?

JdeBP 3208 days ago

Here's roughly what that would look like using Bernstein's C string library (which was not only used in qmail).

    #include "stralloc.h"
    ...
    static stralloc s, t;
    ...
    if (!stralloc_ready(&s, 0)) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();
    if (!stralloc_cat(&t, &s)) die_nomem();

    if (!stralloc_copy(&t, &s)) die_nomem();
    if (!stralloc_cats(&t, "hello")) die_nomem();

    if (!stralloc_copys(&t, "hello")) die_nomem();
    if (!stralloc_cats(&t, "world")) die_nomem();

clarry 3208 days ago

> Even such a simple use case is fraught with major problems: > > 1. who allocates needed memory? > > 2. who free's it?

That's also a major feature. It allows people to write systems that are resilient in the face of tight memory limitations. It's not cool when a language forces string operations to allocate & duplicate memory willy-nilly.

> 3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?

I fail to see how that's a major problem. Why are you concatenating string literals? How common is that?

> 4. what about the lack of function overloading to handle the permutations?

I consider lack of overloading to be a feature. Overloading is one of the things that are way too easily abused, and it makes code auditing harder than it needs to be. Please just type out the different function names so I can see exactly what is going to be called when I read the code. Or use the sprintf family of variadic functions.

ktRolster 3208 days ago

I assume you're referring to OpenBSD here, they didn't use snprintf(). They used asnprintf(), which solves the problem of who should allocate (but not who should free).