Hacker News new | ask | show | jobs
by floatboth 3372 days ago
Have you tried doing string manipulation in C? ;)
3 comments

Not OP, but I actually think string manipulation in C is really elegant. Many people who complain about it have too many allocations in their code and are trying to port the allocation-heavy non-C way of thinking to C. The C way I know focuses mainly on character at a time iteration with emphasis on not copying the source string.

I'm reminded of a time a colleague needed something like string.split, and working in c++ he filled a std::vector<std::string> with the result. Using a more C way he'd really only have needed a couple of pointers on the stack.

> The C way I know focuses mainly on character at a time iteration

Ah, yes, the good 'ole C way that only works reliably in English.

It's a little naïve to think it works on English without the coöperation of the users. (Also, less glibly, things like emoji seem to be becoming more and more popular.)
Indeed. In fact as we speak I am fixing a bug related to string handling with emojis.
Funny.

You can parse utf-8 character at a time. Some characters advance the pointer by 4 at an iteration and some less.

You can, and then you get a 4-byte long character 1-byte before the end of your data, you skip over the null-terminator and into the stack, and bang.

Yes, you can avoid this if you're careful and you understand the intricacies of utf-8 (or some other multi-byte encoding), but it very quickly stops being elegant.

What do you mean by "character"? If you mean code point or "unicode scalar value", sure, but if you mean user-visible character (grapheme), it's much more complicated: even something "simple" like ö could be one or two code points.
I mean your iterator is char* and you advance it by adding. That's it.

I do NOT mean that char itself corresponds to a glyph or codepoint, you are seriously preaching to the choir making that lecture to me.

>you advance it by adding

And when do you stop? UTF-8 strings can have zero bytes in them so treating them as C strings is potentially error prone depending on the context.

What are combinators?

Go parse some zalgo with your 4 per iteration algorithm. I'll be there, waiting and laughing.

C string handling is not elegant, nor does it fit the realities of the world.

So it's somehow C's fault that Unicode uses variable-length encoding, which is automatically going to be harder to process correctly at a byte-by-byte level than a fixed-length method, and also included known-C-incompatible null bytes?
> So it's somehow C's fault that Unicode uses variable-length encoding

Parent said string handling in C was elegant. My point is that it becomes fraught with (even more) issues once you throw non-English language at it.

It is C's decision to handle strings in this way, and the decision of many C programmers to treat all strings as if they are just iterable character pointers.

It's a recipe for bugs.

I am the parent you are talking about. I've made this argument many times with people: Unicode is crazy complicated in any programming language. People think that widening the char width will help - well you seem to be somebody who knows Unicode so you probably know the horrors of surrogates, combining characters vs. pre-composed diacritics, zero-width joiners, Han unification, variation selectors, BiDi... This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above. They all punt the issue to the programmer.

I've heard (mostly here) that Swift does something different and treats glyphs as the basic unit. I haven't had a chance to look at precisely what that does. Given all the issues I've seen elsewhere I'm skeptical that someone, anyone can pull that off correctly.

UTF-8 at least has one elegance (there's that word again) in the design in that you can do some "dumb" ASCII things and if your code does not know what to do with fancy unicode, you can check the high bit of any given octet and "safely" skip over it and any adjacent nonascii sequence if you don't know what it means. This may or may not be applicable to a task at hand.

> This is in no way just a C thing to deal with all that nonsense. I've not seen any language or library that I'd say does it "well" and saves individual programmers from considering the above.

This is true, however even something as simple as storing the (byte) length as part of the string reduces the complexity and the likelihood for bugs.

Other languages also prevent accidental buffer overruns so while they still need to deal with all the same Unicode problems you mentioned, the program likely won't crash if the programmer gets things wrong. The same is not necessarily true of C.

FWIW in Rust you also tend to avoid allocations, since all string manipulation is done via slices -- safe (ptr, len) pairs. It's pretty neat.

IIRC C++ is getting slices too, so it might be able to get better APIs around string manip. But I've seen decent string manip code that avoided allocations.

> Using a more C way he'd really only have needed a couple of pointers on the stack.

This is pretty much how it's done in Rust too via slices. For example, the standard way to split a string is to create an iterator and it won't do any allocations.

You can see two of my current school assignement on Github.

I try to avoid manipulating strings as much as possible. ;)

+1