Hacker News new | ask | show | jobs
by jmull 3215 days ago
I'm going to argue a little differently...

In C, strings were always semantically a sequence of characters (as they are commonly defined elsewhere).

For a while a character was one byte, so the distinction was unimportant (and became blurred).

A char was both a character and a byte. A string was both a sequence of characters and an array of characters... and an array of "char"s, and an array of bytes.

Code -- and programmers! -- became dependent on these equivalency assumptions.

Once it became clear we could no longer pretend that a maximum of 256 characters was tenable (less actually, since the use of 0-31 for control/separation/termination had become standard) we were left with conflict, leading to a variety of uncomfortable choices and compromises.

One such conflict is "char"... should it retain its semantics or its size (one byte)?

The last time I developed seriously in C or C++ it had retained its size, but lost its semantics -- a char IS a byte now. (That was a while ago, I don't know if that's changed -- it sounds like from your post that it hasn't.)

I guess UTF-8 has won out in C and C++ (and elsewhere) so now, while a char is byte, a C/C++ string is: (1) an array of char/bytes; (2) a sequence of characters. The thing that's been dropped is that a string is no longer an array of characters.

(In case there's confusion: here, "array" means an ordered sequence of elements of uniform size with O(1) random access, while sequence is just an ordered sequence of elements that doesn't necessarily offer O(1) random access or elements with uniform size.)

1 comments

It's all this vocabulary fighting that makes this stuff so damn hard for people new to trying to do interesting things with strings. Like, different languages use totally different terms in the API documentation, even.

So, for example, I can figure out how to take a document written in Microsoft Word with that Latin-1 business and make the characters stop sucking in python 3, but I don't even know what to google to do the same thing in javascript, because people use terms like "encoding" and such totally differently.