Hacker News new | ask | show | jobs
by sw00pur 3214 days ago
>No such thing! Strings are an array of integer unicode code points.

I'd argue that, generally, strings are simply arrays of chars, which are bytes.

THe failure here, was keeping the name "string" for what are arrays of codepoints instead of bytes.

2 comments

C does not own the word "string". A string is a piece if text. It is not a byte array.

Unicode strings are arrays of code points which are 21bit numbers.

If the API requires fast subscript (it usually does) then they would be UTF-32 or three-codepoints-in-int64, otherwise more compact internal representation is possible.

If you don't require supporting subscript and allow only iteration over list of code points then in-memory representation of strings can be more compact. It can use UTF-8 or even SCSU or BOCU1.

Some languages use polymorphic unicode strings which store ascii if the value is all-ascii and switch to something else if it isn't (python3.3 and factor come to mind).

I'm going to argue a little differently...

In C, strings were always semantically a sequence of characters (as they are commonly defined elsewhere).

For a while a character was one byte, so the distinction was unimportant (and became blurred).

A char was both a character and a byte. A string was both a sequence of characters and an array of characters... and an array of "char"s, and an array of bytes.

Code -- and programmers! -- became dependent on these equivalency assumptions.

Once it became clear we could no longer pretend that a maximum of 256 characters was tenable (less actually, since the use of 0-31 for control/separation/termination had become standard) we were left with conflict, leading to a variety of uncomfortable choices and compromises.

One such conflict is "char"... should it retain its semantics or its size (one byte)?

The last time I developed seriously in C or C++ it had retained its size, but lost its semantics -- a char IS a byte now. (That was a while ago, I don't know if that's changed -- it sounds like from your post that it hasn't.)

I guess UTF-8 has won out in C and C++ (and elsewhere) so now, while a char is byte, a C/C++ string is: (1) an array of char/bytes; (2) a sequence of characters. The thing that's been dropped is that a string is no longer an array of characters.

(In case there's confusion: here, "array" means an ordered sequence of elements of uniform size with O(1) random access, while sequence is just an ordered sequence of elements that doesn't necessarily offer O(1) random access or elements with uniform size.)

It's all this vocabulary fighting that makes this stuff so damn hard for people new to trying to do interesting things with strings. Like, different languages use totally different terms in the API documentation, even.

So, for example, I can figure out how to take a document written in Microsoft Word with that Latin-1 business and make the characters stop sucking in python 3, but I don't even know what to google to do the same thing in javascript, because people use terms like "encoding" and such totally differently.