Hacker News new | ask | show | jobs
by julian37 1898 days ago
UTF-8 is just... so well designed.

One little feature I like in particular is that if you're looking for an ASCII-7 character in a UTF-8 stream -- say, a LF or comma -- you don't have to decode the stream first because all bytes in the encoding of non-ASCII-7 characters have the high bit set. Or as Wikipedia puts it:

> Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.

It's amazing to hear they put it together in one night at a diner! :-D

7 comments

> It's amazing to hear they put it together in one night at a diner! :-D

I guess you're saying that in good humor. But I'll add this because it makes me appreciate how these things happen:

> What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it.

"We hated it" -- there is just so much going on in those 3 words. They could have been suffering with the previous state for a year for all we know. And even if not, to know you hate something just takes a lot of system building experience to get to. And then when opportunity struck they probably already had a laundry list of grievances they had built up over that time and were ready to pounce.

Yes, exactly!

If they hadn't had on-the-ground experience of the plan-9 version, and been able to see what parts of it they wanted to keep and what parts needed to be done different from that actual experience...

Often you can't build the polished thing until you have experienced the thing before.

Lately I get discouraged that there seems to be not so much attention to "prior art" in software development, that's the only way to make progress!

But to build it in 4 days!

This still strikes me as the height of 1990s programming moxy.

While the design is nice, it doesn't seem -that- earthshattering that it was done in four days. Once you make the realiziation that 'wait, ascii only needs the lower 7 bits, let's work off that', it's all just details past that.

Don't get me wrong, I love UTF-8 and it is well thought out and designed. But the end result is not so complicated, so much so that pretty much anyone reading the rules could understand it.

I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems. Today's 'amazing' things would involve image recognition or processing, self driving cars, better ML/AI algos. Things that are hard to impossible to be done by a guy or two over the weekend.

Sadly, as a result, I think we'll have fewer 'programming heroes' than existed in previous decades.

> While the design is nice, it doesn't seem -that- earthshattering that it was done in four days.

And yet it may have needed a genius to desgin and write something so simple. UTF-8 was not the first multi-lingual encoding system; here's an entire list of them, worked on by a lot of probably very smart people:

* https://en.wikipedia.org/wiki/Template:Character_encodings

It only seems 'obvious' in hindsight:

* https://en.wikipedia.org/wiki/Hindsight_bias

Edit: A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. — Antoine de Saint-Exupery

>I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems.

git was 2005, and that was probably similarly impactful in the version control space (in that it was much closer to fundamentally correct, than its predecessors). And there are quite a few standards out there that only survive by virtue of already having been established -- not because they meet any reasonable bar of quality. IPv4 (and all the grand schemes to work around the terror of NAT), email (the worst communication system, except for all the others), SQL (the language specifically -- a mishmash of keywords with almost no ability to properly compose), etc.

The bigger difference I think between the 90's and now is that it was probably much easier to make your new superior standard actually be used -- you could implement a new kernel today which was fantastically superior to linux, and you're much more likely than not to get zero traction (ex: plan9) simply by virtue of how well-entrenched linux already is.

> git was 2005

I'm not sure I'd consider git to be "low-hanging fruit"

Given that Torvalds apparently went from design to implementation in 3 days, and 2 months later had it officially managing the kernel, I wouldn’t say it was particularly high-hanging.
Pretty sure Git was a side project so that Linus could manage Linux source code like he wanted.
Yeah, this is great! I came across that recently when working on a parser in Zig, which treats strings as arrays of bytes. I didn't know much about UTF8 other than that it's scary and programmers mess up text processing all the time. I was worried that a multi byte code point could trick my simple char switch which was looking for certain ASCII characters. But then I came across that bit you quoted and was but surprised and relieved!

Then, when I needed to minimally handle non-ASCII characters I found Zig's minimal unicode helper library and saw what I was looking for in a small function that takes a leading byte and returns how many bytes there are in the codepoint. I was impressed with the spec again!

> It's amazing to hear they put it together in one night at a diner! :-D

On the one hand, sure. But on the other you have Ken Thompson.

I wonder how many pieces of computing technology used today were put together in a single evening by a team of motivated developers. Rubygems, for example, was written in a couple of hours at the back of a hotel bar, then demoed (complete with network install and versioning) at Rubyconf the following morning.

As I age, I'm starting to believe that the best technology is often built this way, rather than stewing for years in an ISO subcommittee. Limited development time can lead to features that provide the greatest value for the time spent.

Here's a picture of Thompson designing UTF-8 on a placemat that night at the diner:

https://www.youtube.com/watch?v=mhvaeHoIE24&t=23m34s

Thanks for that link!
> It's amazing to hear they put it together in one night at a diner! :-D

I will bet that he had half formed ideas of how it could work from the previous pain with the "original UTF". The best people I work with are constantly looking at things that are wrong and coming up with idea for how they could be better even if 99% of them will never be used.

I think this is more of a case where we were lucky, since most applications used 7-bit ASCII and the high bit was available for UTF-8 encoding.