Hacker News new | ask | show | jobs
by ticking 4107 days ago
Even though I'm not a native english speaker and couldn't write my name in ASCI, I really despise Unicode.

Its broken technically and a setback socially.

Unicode itself is such a unfathomably huge project that it's impossible to do it right, too many languages, too many weird writing systems, and too many ways to do mathematical notation on paper that can't be expressed. Just look at the code pages, they are an utter mess.

Computers and ASCI were a chance to start anew, to establish english as a universal language, spoken by everybody.

The pressure on governments who would wanted to partake in the digital revolution would have forced them to introduce it as an official secondary language.

Granted english is not the nicest language, but is the best candidate we have in terms of adoption, and relative simplicity (Mandarin is another contester, but several thousand logograms are really impractical to encode.).

Take a look at the open source world, where everybody speaks english and collaborates regardless of nationality. One of the main factors why this is possible, is that we found a common language, forced on us by the tools and programming languages we use.

If humanity wants get rid of wars, poverty and nationalism, we have to find a common language first.

A simple encoding and universal communication is a feature, fragmented communication is the bug.

Besides. UTF-8 is broken because it doesn't allow for constant time random character access and length counting.

1 comments

Why do you think English is the best candidate for the universal language, how do you define simplicity? First of all, pronunciation and spelling are almost unrelated and you have to learn them separately. That results in really different accents throughout the world. Even if you look at AmE and BrE, they differ much at the word level. Which one you want to choose? Besides, personally I find English really ambiguous and density of idioms in average text repelling, although that's only a subjective opinion.

Usage of Latin alphabet in English seems like it's on plus, but there's at least one language that uses that simple alphabet better.

> Besides. UTF-8 is broken because it doesn't allow for constant time random character and length counting.

And why you'd want that? And how do you define length? Are you a troll?

English is the best candidate because it has the second largest user base (1.2 Billion vs 1.3 Billion for Mandarin), http://en.wikipedia.org/wiki/List_of_languages_by_total_numb... and is twice as spoken as the third most popular language Spanish. (0.55 Billion)

If I got to pick the universal language, it would be Lojban (a few hundred speakers), but that is not a realistic goal, teaching the other 6 Billion people a language that is already spoken by 1/7th of the population is at least plausible.

> Why would you want that...

Why would, you not want that?! Many popular programming languages are based on array indexing through pointer arithmetic, having a variable width encoding there is a horrible idea, because you have to iterate through the text to get to an index.

Length is the number of characters, which is just the number of bytes in ASCI, but has to be calculated by looking at every character in UTF-8.

Even if 1.2 billion seems a lot, that's still a small fraction of a world's population. So every choice of a universal language would force majority of a world to learn new one. So that's why I think winning popularity contest is a poor argument and we shouldn't look at that and focus on things like simplicity (which I don't find in English), speed of learning, consistency, expressiveness etc. I'd be happy to use Lojban (it's easier for machines too, I guess) or any other invented language. If I had to pick one from popular ones, I'd like Spanish more than English.

I was asking what are your specific usecases, which forbid you to treat UTF-8 string as a black box blob of bytes? If dealing with international code, you'd rather want to use predefined functions. If you want to limit yourself to ASCII, just do it and simply don't touch bytes >= 0x80.

And what is a character? Do you mean graphemes or codepoints? Or something else? Few years before I was thinking like you – that calculating length is a useful feature. But most often when you think about your usecase, you realise either that you don't need length or you need some other kind of length: like monospace-width, rendered-width or some kind of entropy-based amount of information. Twitter is the only case I know, where you want to really count "characters". And I find it really silly: eg. Japanese tweet vs. English tweet.

With Unicode these predefined functions have to be large and complex. You won't be able to use them on embedded systems for example.