Hacker News new | ask | show | jobs
by noam_k 812 days ago
Very cool stuff!

I haven't done much C development lately, so I'm curious how often `strcpy` and `strcat` are used. Last I checked they're almost as big no-nos as using goto. (Yes, I know goto is often preferred in kernel dev...) Can anyone share on how helpful the c-string analyses are to them?

6 comments

The use of goto is unambiguously correct and elegant in some contexts. Unwavering avoidance of goto can lead to unnecessarily ugly, convoluted code that is difficult to maintain. It usually isn't common but it has valid uses.

While use of functions like `strcpy` are less advisable, there are contexts in which they are guaranteed to be correct unless other strong (e.g. language-level) invariants are broken, in which case you have much bigger problems. In these somewhat infrequent cases, there is a valid argument that notionally safer alternatives may be slightly less efficient for no benefit.

strcpy and friends don't really have any benefits beyond just being there. The "safer" versions are still unsafe in many cases, while being less performant and more annoying to use.

Writing a strbuffer type and associated functions isn't particularly hard and the resulting interface is nicer to use, safer, and more efficient.

I argue strview (non-owning) is almost always what is needed. Most of string operations are searching and slicing.
You also need a strview. Not really relevant for avoiding strcpy and strcat though.
> The use of goto is unambiguously correct and elegant in some contexts.

For C, absolutely. For C++, it's likely a footgun.

It has fewer use cases in C++ but it still has use cases where the alternatives are worse.
What is a C++ use case where RAII doesn't solve the problem better? I imagine one exists, but I've never encountered it in 20 years. Conversely, I've seen it used inappropriately for cleanup many times (which would be fine in C).
Some usage of goto is still idiomatic in C if used in ways logically equivalent to structured programming constructs C lacks. It requires some care, but I mean, it's C.

(I'm not however fond at all of longjmp)

> (I'm not however fond at all of longjmp)

I don't think there is any justifiable reason to use setjmp/longjmp in modern C code. At best it's a crude imitation of throw/catch semantics; if you really want that, C++ has a real implementation.

There's nothing wrong with simple usages of goto.

The strxcpy family on the other hand is complete garbage and should never be used for any reason. I'm horrified that they're used in the kernel at all. All of those functions (and every failed attempt at "fixing" them) should have been nuked from orbit.

This is the approach taken in git https://github.com/git/git/blob/master/banned.h
> There's nothing wrong with simple usages of goto

Indeed a like a few gotos here and there for doing cleanup toward the end of the function.

Or to break out of nested loops. The problem is with unstructured goto spaghetti making the code impossible to follow without essentially running it in your head (or a debugger).

Goto + Switch (or the GCC computed goto extension) is also a wonderful way to implement state machines.

What's wrong with `strncpy`?
strncpy won't always write a trailing nul byte, causing out of bounds reads elsewhere. It's a nasty little fellow. See the warning at https://linux.die.net/man/3/strncpy

strlcpy() is better and what most people think strncpy() is, but still results in truncated strings if not used carefully which can also lead to big problems.

Speaking of strlcpy, Linus has some colorful opinions on it:

> Note that we have so few 'strlcpy()' calls that we really should remove that horrid horrid interface. It's a buggy piece of sh*t. 'strlcpy()' is fundamentally unsafe BY DESIGN if you don't trust the source string - which is one of the alleged reasons to use it. --Linus

Maybe strscpy is finally the one true fixed design to fix them all. Personally I think the whole exercise is one of unbeliavable stupidity when the real solution is obvious: using proper string buffer types with length and capacity for any sort of string manipulation.

> the real solution is obvious

If it were obvious it would have been done already. Witness the many variants that try to make it better but don't.

> using proper string buffer types with length and capacity

Which you then can't pass to any other library. String management is very easy to solve within the boundaries of your own code. But you'll need to interact with existing code as well.

> If it were obvious it would have been done already. Witness the many variants that try to make it better but don't.

Every other language with mutable strings, including C++, does it like that. It is obvious. The reason it is not done in C is not ignorance, it is laziness.

> Which you then can't pass to any other library. String management is very easy to solve within the boundaries of your own code. But you'll need to interact with existing code as well.

Ignoring the also obvious solution of just keeping a null terminator around (see: C++ std::string), you should only worry about it at the boundary with the other library.

Same as converting from utf-8 to utf-16 to talk to the Windows API for example.

For me the "real" solution looks something like this:

    ssize_t strxcpy(char* restrict dst, const char* restrict src, ssize_t len)
Strxcpy copies the string from src to dst. The len parameter is the number of bytes available in the dst buffer. The dst buffer is always terminated with a null byte, so the maximum length of string that can be copied into it is len - 1. strxcpy returns the number of characters copied on success, but can return the following negative values:

    E_INVALID_PARAMETER: Ether dst or src are NULL or len < 1, no data was copied
    W_TRUNCATED: len - 1 bytes were copied but more characters were available in src.
strxcat would work similarly. I have not decided if the return value should include the terminating null or not.
Wow yeah this seems to summarize well the usual api flakiness and just shuffling of C

It seems people come with "one more improvement" that's broken in one way or the other

The problem with strlcpy is the return value. You can be burned badly if you are using it to for example pull out a fixed chunk of string from a 10TB memory mapped file, especially if you're pulling out all of the 32 byte chunks from that huge file and you just wanted a function to stick the trailing 0 on the string and handle short reads gracefully.

It's even worse if you are using it because you don't fully trust the input string to be null terminated. Maybe you have reasons to be believe that it will be at least as long as you need, but can't trust that it is a real string. As a function that was theoretically written as "fix" for strncpy it is worse in some fundamental ways. At least strncpy is easy enough to make safe by always over-allocating your buffer by 1 byte and stuffing a 0 in the last byte.

strncpy() also zero pads the entire buffer. If it's significantly larger than the copied string you're wasting cycles on pointless move operations for normal, low-security string handling. This behavior is for filling in fixed length fields in data structures. It isn't suitable for general purpose string processing.
#define strncpyz(d,s,l) *(strncpy(d,s,l)+(l))=0

Of course this one is unsafe for macro expansion. But well, its C :)

I'd rather put the final nul at d+l-1 than at d+l, so that l can be the size of d, not "one more than the size of d":

  strncpyz(buf,src,sizeof buf);
As others have already pointed out it, it doesn't guarantee that the result is null-terminated. But that's not the only problem! In addition, it always pads the remaining space with zeros:

    char buf[1000];
    strncpy(buf, "foo", sizeof(buf));
This writes 3 characters and 9997 zeros. It's probably not what you want 99% of the time.
It's not possible to use it safely unless you know that the source string fits in the destination buffer. Every strncpy must be followed by `dst[sizeof dst - 1] = 0`, and even if you do that you still have no idea if you truncated the source string, so you have to put in a further check.

    strncpy (dst, src, (sizeof dst) - 1);
    dst[(sizeof dst) - 1] = 0;
    int truncated = strlen (dst) - strlen (src);
Without the extra two lines after every strncpy, you're probably going to have a a hard to discover transient bug.
if you really want to use standard C string functions, use instead:

    int ret = snprintf(dst, sizeof dst, "%s", src);
    if (ret >= n || ret < 0)
    {
        /* failed */
    }
or as a function:

    bool ya_strcpy(const char* s, char* d, size_t n)
    {
        int cp = snprintf(d, n, "%s", s);
        bool ok = cp >= 0 && cp < n;
        ok ? *s = *s : 0;
        return ok;
    }
snprintf only returns negative if an "encoding error" occurs, which has to do with multi-byte characters.

I think for that to possibly happen, you have to be in a locale with some character encoding in effect and snprintf is asked to print some multi-byte sequence that is invalid for that encoding.

Thus, I suspect, if you don't call that "f...f...frob my C program" function known as setlocale, it will never happen.

> Thus, I suspect, if you don't call that "f...f...frob my C program" function known as setlocale, it will never happen.

Of all the footguns in a hosted C implementation, I believe setlocale (and locale in general) is so broken that even compilers and library developers can't workaround it to make it safe.

The only other unfixable C-standard footgun that comes close, I think, are the environment-reading-and-writing functions, but at least with those, worst-case is leaking a negligible amount of memory in normal usage, or using an old value even when a newer one is available.

I actually do use `snprintf()` and friends.
except no one does that return code check and worse they often use the return code to advance a pointer in concatenated strings
`strncpy` is commonly misunderstood. It's name misleads people into thinking it's a safely-truncating version of `strcpy`. It's not.

I've seen a lot of code where people changed from `strcpy` to `strncpy` because they thought that was safety and security best practice. Even sometimes creating a new security vulnerability which wasn't there with `strcpy`.

`strncpy` does two unexpected things which lead to safety, security and performance issues, especially in large codebases where the destination buffers are passed to other code:

• `strncpy` does NOT zero-terminate the copied string if it limits the length.

Whatever is given the copied string in future is vulnerable to a buffer-read-overrun and junk characters appended to the string, unless the reader has specific knowledge of the buffer length and is strict about NOT treating it as a null-terminated string. That's unusual C, so it's rarely done correctly. It also doesn't show up in testing or normal use, if `strnlen` is "for safety" and nobody enters data that large.

• `strncpy` writes the entire destination buffer with zeros after the copied string.

Usually this isn't a safety and security problem, but it can be terrible for performace if large buffers are being used to ensure there's room for all likely input data.

I've seen these issues in large, commercial C code, with unfortunate effects:

The code had a security fault because under some circumstances, a password check would read characters after the end of a buffer due to lack of a zero-terminator, that authors over the years assumed would always be there.

A password change function could set the new password to something different than the user entered, so they couldn't login after.

The code was assumed to be "fast" because it was C, and avoided "slow" memory allocation and a string API when processing strings. It used preallocated char arrays all over the place to hold temporary strings and `strncpy` to "safely" copy. They were wrong: It would have run faster with a clean string API that did allocations (for multiple reasons, not just `strncpy`).

Those char arrays had the slight inconvenience of causing oddly mismatched string length limits in text fields all over the place. But it was worth it for performance, they thought. To avoid that being a real problem, buffers tended to be sized to be "larger" than any likely value, so buffer sizes like 256 or 1000, 10000 or other arbitrary lengths plucked at random depending on developer mood at the time, and mismatched between countless different places in the large codebase. `strncpy` was used to write to them.

Using `malloc`, or better a proper string object API, would have run much faster in real use, at the same time as being safer and cleaner code.

Even worse, sometimes strings would be appended in pieces, each time using `strncpy` with the remaining length of the destination buffer. That filled the destination with zeros repeatedly, for every few characters appended. Sometimes causing user-interactions that would take milliseconds if coded properly, to take minutes.

Ironically, even a slow scripting language like Python using ordinary string type would have probably run faster than the C application. (Also Python dictionaries would have been faster than the buggy C hash tables in that application which took O(n) lookup time, and SQLite database tables would have been faster, smaller and simpler than the slow and large C "optimised" data structures they used to store data).

It doesn't guarantee that the output is null terminated. Big source of exploits.
gotos are fine if used judiciously. strcpy and strcat are “fine” in that they work when you know your code is correct and you have big problems if you don’t. But this describes most of C, unfortunately.
> gotos are fine if used judiciously

Is there a language feature that is not? :)

If you use trigraphs in your code I will be very upset
> Last I checked they're almost as big no-nos as using goto.

I don't think so. Gotos are fine, strcat and strcpy without a malloc with the correct size in the same scope is a code smell.

> Last I checked they're almost as big no-nos as using goto.

Huh? Why is goto a no-no? It is there for good reason. I think we all agree with Dijkstra that, in his words, unbridled gotos are harmful, but C's goto is most definitely bridled. I doubt any language created in the last 50+ years has unbridled gotos. That's an ancient programming technique that went out of fashion long ago (in large part because of Dijkstra).

Languages other than C give you options for flow control so that you don't need goto for that. It is a spectrum, if you only use goto to jump to the end of a small function on error it is okay, though I prefer something better in my language. I've seen 30,000 line functions with gotos used for flow control (loops and if branches) - something you can do in C if you are really that stupid and I think we will all agree is bad. This 30,000+ line function with gotos as flow control was a lot more common in Dijkstra's day.
We all agree that you shouldn't write bad code. Not using goto, not using any language construct.

But when unbridled gotos were the only tool in the toolbox, bad code was an inevitability in a codebase of any meaningful size. Not even the best programmer was immune. This is what the "Go to statement considered harmful" paper was about.

It was written in 1968. We listened. We created languages that addressed the concerns raised and moved forward. It is no longer relevant. Why does it keep getting repeated in a misappropriated way?

In 1968 they had better languages and programmers were still using goto for control in them despite better options.
Of course. The ideas presented in said paper went back at least a decade prior, but languages were still showing up with unbridled gotos despite that. But that has changed in the meantime. What language are you or anyone you know using today that still has an unbridled goto statement?
> Languages other than C give you options for flow control so that you don't need goto for that.

The idiom `if (error) goto cleanup` is about the only thing I see goto used for. What flow control replaces that other than exceptions?

Jumping out of nested loops. Implementing higher level constructs like yield or defer. State machines. Compiler output that uses C as a "cross-platform" assembly language.

All of them are better served with more specialized language constructs but as a widely applicable hammer goto is pretty nice.

I don't expect C to have good error handling or generators any time soon but with goto I can deal with it.

I'm actually familiar with this, having used libprotothreads in production for about 4 years.

Something like libprotothreads can't actually be implemented in a language that doesn't have gotos, so yeah, I see the need for it.

Compiling HLL constructs in some of those scenarios ultimately produces a jump statement. So, it makes sense that a higher-level version of a jump would be helpful in the same situations.
> What flow control replaces that other than exceptions?

defer has gained in popularity for that situation.

RAII + destructors

Though gcc supports cleanup functions, just not very ergonomically.

> 30,000 line functions with gotos

The problem there is the 30K line function, not the goto!

30k functions are a problem but they are manageable if goto isn't used in them. I prefer not to but a have figured them out.
Wow! Longest single function I can think of having written is ~200 lines. I always feel bad when editing it but there's no useful way to break it down so I let it be. But a single 30,000 line function? Wow.
I'll take a 30k line function that does one thing over 30 1k line functions that are used once...
Agreed! Breaking into multiple functions for no reason other than style isn't smart either.
goto used in certain idiomatic ways (e.g. to jump to cleanup code after an error, or to go to a `retry:` label, or to continue or break out of a multiply nested loop) is fine. What's annoying is bypassing control flow with random goto spaghetti.