Hacker News new | ask | show | jobs
by asveikau 3372 days ago
Not OP, but I actually think string manipulation in C is really elegant. Many people who complain about it have too many allocations in their code and are trying to port the allocation-heavy non-C way of thinking to C. The C way I know focuses mainly on character at a time iteration with emphasis on not copying the source string.

I'm reminded of a time a colleague needed something like string.split, and working in c++ he filled a std::vector<std::string> with the result. Using a more C way he'd really only have needed a couple of pointers on the stack.

3 comments

> The C way I know focuses mainly on character at a time iteration

Ah, yes, the good 'ole C way that only works reliably in English.

It's a little naïve to think it works on English without the coöperation of the users. (Also, less glibly, things like emoji seem to be becoming more and more popular.)
Indeed. In fact as we speak I am fixing a bug related to string handling with emojis.
Funny.

You can parse utf-8 character at a time. Some characters advance the pointer by 4 at an iteration and some less.

You can, and then you get a 4-byte long character 1-byte before the end of your data, you skip over the null-terminator and into the stack, and bang.

Yes, you can avoid this if you're careful and you understand the intricacies of utf-8 (or some other multi-byte encoding), but it very quickly stops being elegant.

What do you mean by "character"? If you mean code point or "unicode scalar value", sure, but if you mean user-visible character (grapheme), it's much more complicated: even something "simple" like ö could be one or two code points.
I mean your iterator is char* and you advance it by adding. That's it.

I do NOT mean that char itself corresponds to a glyph or codepoint, you are seriously preaching to the choir making that lecture to me.

>you advance it by adding

And when do you stop? UTF-8 strings can have zero bytes in them so treating them as C strings is potentially error prone depending on the context.

> UTF-8 strings can have zero bytes in them

This is not true. A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

What you do need to look out for is malformed utf-8, for example, 1 byte before the null terminator you get a lead byte saying the next character is 4-bytes long.

If you're not checking each byte for null and just skipping based on the length indicated by the lead byte then you're in for a crash.

Where utf-8 strings differ from C strings is slicing. You can't just slice the string at some random point without doing extra validation to make sure you only slice on codepoint boundaries.

Unless you have U+0000 there isn't any other sequence of code points that has an 0x00 byte in UTF-8. I don't see this as a huge problem.

If you really do need it there are some C language libraries that use "pascal-ish" structs to do strings. UNICODE_STRING in Windows comes to mind. Doing strings in C doesn't force you to use C strings, it's just the most common thing to do.

It's the same for ASCII - UTF-8 zero byte is NUL.
What are combinators?

Go parse some zalgo with your 4 per iteration algorithm. I'll be there, waiting and laughing.

C string handling is not elegant, nor does it fit the realities of the world.

So it's somehow C's fault that Unicode uses variable-length encoding, which is automatically going to be harder to process correctly at a byte-by-byte level than a fixed-length method, and also included known-C-incompatible null bytes?
> So it's somehow C's fault that Unicode uses variable-length encoding

Parent said string handling in C was elegant. My point is that it becomes fraught with (even more) issues once you throw non-English language at it.

It is C's decision to handle strings in this way, and the decision of many C programmers to treat all strings as if they are just iterable character pointers.

It's a recipe for bugs.

I am the parent you are talking about. I've made this argument many times with people: Unicode is crazy complicated in any programming language. People think that widening the char width will help - well you seem to be somebody who knows Unicode so you probably know the horrors of surrogates, combining characters vs. pre-composed diacritics, zero-width joiners, Han unification, variation selectors, BiDi... This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above. They all punt the issue to the programmer.

I've heard (mostly here) that Swift does something different and treats glyphs as the basic unit. I haven't had a chance to look at precisely what that does. Given all the issues I've seen elsewhere I'm skeptical that someone, anyone can pull that off correctly.

UTF-8 at least has one elegance (there's that word again) in the design in that you can do some "dumb" ASCII things and if your code does not know what to do with fancy unicode, you can check the high bit of any given octet and "safely" skip over it and any adjacent nonascii sequence if you don't know what it means. This may or may not be applicable to a task at hand.

> This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above.

This is true, however even something as simple as storing the (byte) length as part of the string reduces the complexity and the likelihood for bugs.

Other languages also prevent accidental buffer overruns so while they still need to deal with all the same Unicode problems you mentioned, the program likely won't crash if the programmer gets things wrong. The same is not necessarily true of C.

FWIW in Rust you also tend to avoid allocations, since all string manipulation is done via slices -- safe (ptr, len) pairs. It's pretty neat.

IIRC C++ is getting slices too, so it might be able to get better APIs around string manip. But I've seen decent string manip code that avoided allocations.

> Using a more C way he'd really only have needed a couple of pointers on the stack.

This is pretty much how it's done in Rust too via slices. For example, the standard way to split a string is to create an iterator and it won't do any allocations.