Hacker News new | ask | show | jobs
by zzo38computer 34 days ago
I agree with most of the criticisms they make.

I agree that pointer and length is better than null-terminated strings (although it is difficult in C, and as they mention you will have to use a macro (or some additional functions) to work this in C).

Making the C standard library directly against syscalls is also a good idea, although in some cases you might have an implementation that needs to not do this for some reason, generally it is better for the standard library directly against syscalls.

FILE object is sometimes useful especially if you have functions such as fopencookie and open_memstream; but it might be useful (although probably not with C) to be able to optimize parts of a program that only use a single implementation of the FILE interface (or a subset of its functions, e.g. that does not use seeking).

2 comments

Making every C call a system call is not a good idea at all - think about malloc() etc - the OS shouldn’t care about individual allocations and only worry about providing brk() etc. otherwise, performance will die if you’re doing a thousand system calls per second!
No modern libc uses (or should use) brk() as the heap. Allocate virtual memory using mmap, VirtualAlloc, etc., and manage your set of heaps.
I believe glibc uses both mmap and brk depending on the situation.
It is not what I meant and also seems to me not what is meant by sp.h either.
Null terminated strings have some merits but they should be a completely different data type like in Freebasic.
Are there other merits than availability of literals in C?

It seems like one of the worst data structures ever - lookup complexity of a linked list with a expansion complexity of an array list with security problems added as a bonus.

One I can think of is simplicity. No need to worry about what the type of the string should be (size_t?) or where it should be stored. Just pass around a pointer. Pointers fit the size of a CPU register most of the time. Though in my opinion the drawbacks (O(N) performance, NUL forbidden etc.) outweigh this benefit we are stuck. Many kernel interfaces like open, getdents etc. assume NUL-terminated strings, therefore any low-level language or library has to support them.
But (i32 length, byte[] data) is as complex as (byte[] data, '\0'), its two-parts anyway. Of course it allows potentially for very long strings at the cost of just a single byte spent as a terminator. Beside the rarity of such a case, the "space savings" might play a role on a PDP11, or on a Z80, but not on any of the modern architectures that need structures aligned to 32 or even 64 bit boundary. The efficiency and security costs far outweigh any savings is space or simplicity (heh) of processing.

Null-terminated strings are the other billion-dollar mistake, along with the original NULL.

Arrays as glorified pointers were the mistake. Null terminated strings are a natural result of that design choice.

Null pointers however were not a mistake, despite how popular slandering them has become. A reasonable case can be made that any modern language should enforce null checks (and bound checks, and ...) or at the least provide them by default but that is neither here nor there as far as C is concerned.

Tony Hoare himself called NULL a mistake. But the problem is not in the ability to set a pointer to a null value, of course. The problem is that all pointers are nullable, and there's no way to statically enforce their being non-null. I wonder how feasible data flow analysis would be in 1969 though.
It's fine as a serialization/deserialization primitive for on-disk files, as long as the NULL character is invalid.

String tables in most object file formats work like that, a concatenated series of ASCIIZ strings. One byte of overhead (NUL), requires only an offset into one to address a string and you can share strings with common suffixes. It's a very compact layout.

Nothing prevents you from using a shared pool of strings that don't have null terminator. It can even be more efficient, since you don't have the null byte to handle at string end. Depending on the maximum string length you want to support, it doesn't even have to take more space.
How do you represent that pool of strings on-disk?

If we concatenate the raw strings together without the null terminator, either all string references will require a length on top of the offset (25% size penalty for a Elf32_Sym), or we'll need a separate descriptor table that stores string offsets and lengths to index into.

If we prepend strings with a length (let's say LEB128), we'll be at best tied with null-terminated strings because we'd have a byte for the length vs. a byte for the terminator. At worst, we'll have a longer string table because we'd need more than one byte to encode a long string length and we would lose the ability to share string suffixes.

Out of all the jank from a.out and COFF that was eliminated with ELF, that representation for the string table was kept (in fact, the only change was mandating a null byte at the beginning to have the offset 0 indicate a null string). It works fine since the 1970s and doesn't cause undue problems, as nothing prevents a parser to spit out std::string_view instead of const char* for the application code.

For short strings (probably most of them) - use a byte for the length (at the string/symbol definition site, alongside the offset) (adds 1 byte * symbols, use high bit if necessary to add bytes for longer strings). You need the offset into the table anyway. It isn't strictly better, but it isn't strictly worse, and it gives you the option to reuse sub-strings.
When using null terminated strings, parsing can be branchless because you don't need bounds checks and can use a jump table indexed by the byte.
Hearing someone mention FreeBASIC really brings me back. It was the first language I ever used pointers in.