| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Thiez 2803 days ago
	While it sure is possible to do text manipulation in C, I don't think it should ever be the first choice, even if 'fastest' is a goal. A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really). But C has those annoying zero-terminated strings, so if you want to manipulate arbitrary unicode strings the first thing you can do is kiss the string functions in the C standard library goodbye. Which you probably want to do anyway because pascal-strings are simply better. I would use Rust or C++ for this task.

2 comments

knome 2803 days ago

> A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really)

What? My understanding was that utf8 was crafted specifically so that the only null byte in it was literally NUL. That all normal human language described by a utf8 string will never contain a NUL. They're comparable to C strings in that way, where it can be used safely as an end of string marker. If you have embedded NULs, it's not really utf8, is it?

link

masklinn 2803 days ago

> They're comparable to C strings in that way, where it can be used safely as an end of string marker. If you have embedded NULs, it's not really utf8, is it?

It is. NUL is a C-string convention, as far as unicode is concerned NULL (U+0000) is a perfectly normal codepoint (very much unlike e.g. the U+D800–U+DFFF range).

link

Dylan16807 2803 days ago

> My understanding was that utf8 was crafted specifically so that the only null byte in it was literally NUL.

Correct.

> That all normal human language described by a utf8 string will never contain a NUL.

Correct.

> If you have embedded NULs, it's not really utf8, is it?

Incorrect.

NUL is a valid character. If you accept arbitrary utf-8, or arbitrary ascii, or arbitrary 8859-1, then there might be embedded NUL. You can filter them out if you want, but they're not invalid.

link

paavoova 2803 days ago

It's invalid for unix filenames to have a null character. Therefore, if your application is printing filenames in their unicode representation, it doesn't ever need to consider there to be a null byte. This of course isn't an arbitrary case, but it shows one can make assumptions regardless of the "validity" of a character. I believe for most cases of arbitrary input, the correct and safe thing to do is to assume a byte stream of unknown encoding.

link

Thiez 2803 days ago

Since we arrived on this null-character discussion by considering text manipulation in C, I suspect most comments in this thread are made in the assumption that the text must be manipulated in some way (mine are!), so treating it as a byte stream of unknown encoding doesn't really solve the problem.

While null in filenames may be forbidden on Unix (and also on Windows), there are more exotic systems where it is allowed [1]. When writing portable software it's probably best not to make assumptions about what characters will never be in a filename.

Naturally if you have a problem where you can get away with just moving bytes around and never making assumptions about its contents then that is a great solution.

[1]: https://en.wikipedia.org/wiki/Filename#Comparison_of_filenam...

link

paulddraper 2802 days ago

It's also invalid for filenames to have a slash, but I don't think that's very relevant to the discussion at hand.

link

masklinn 2803 days ago

> Which you probably want to do anyway because pascal-strings are simply better.

They're not though. While having an explicit length is great, p-strings means the length is the first item of the data buffer, which is just awful, and why Pascal was originally limited to 255 byte strings.

Rust or C++ use record-strings, where the string type is a "rich" stack-allocated structure of (*buffer, length[, capacity], …) rather than just a buffer/pointer.

link

Thiez 2803 days ago

That is a fair point, I misunderstood the term to refer to any type of string where the length is stored explicitly. I'll try and refer to them by their correct name ('record strings') from now on :-)

link

Dylan16807 2803 days ago

> p-strings means the length is the first item of the data buffer, which is just awful

You can represent it as a struct of (length, char[]) which isn't awful.

link

masklinn 2803 days ago

> You can represent it as a struct of (length, char[]) which isn't awful.

It kinda is still: if you're storing it on the stack you're dealing with an unsized on-stack structure which is painful, and if you're not you're paying a deref for accessing the length which you don't need to. If by `char[]` you mean `char*` then it's a record string, not a p-string.

link

Dylan16807 2803 days ago

I mean a variable-length array, all stored together.

Presumably you'd allocate it on the heap in general. But a record string also requires a heap allocation.

Most of the time you're touching the length you're probably touching the string data too, so that dereference isn't going to cost very much. And it comes with a tradeoff of more compact local data. So I stand by it being not awful! It may not be perfect, but it's a solid option.

link

Thiez 2803 days ago

When you have record strings you get slicing for free though. Without the indirection of a pointer you have to copy data when you slice (or you must have a separate 'sliced string' type).

link

Dylan16807 2803 days ago

"free" if you ignore the cost of doing lifetime management. So beneficial in some use cases but not others.

link