Hacker News new | ask | show | jobs
by krdln 4106 days ago
Why do you think English is the best candidate for the universal language, how do you define simplicity? First of all, pronunciation and spelling are almost unrelated and you have to learn them separately. That results in really different accents throughout the world. Even if you look at AmE and BrE, they differ much at the word level. Which one you want to choose? Besides, personally I find English really ambiguous and density of idioms in average text repelling, although that's only a subjective opinion.

Usage of Latin alphabet in English seems like it's on plus, but there's at least one language that uses that simple alphabet better.

> Besides. UTF-8 is broken because it doesn't allow for constant time random character and length counting.

And why you'd want that? And how do you define length? Are you a troll?

1 comments

English is the best candidate because it has the second largest user base (1.2 Billion vs 1.3 Billion for Mandarin), http://en.wikipedia.org/wiki/List_of_languages_by_total_numb... and is twice as spoken as the third most popular language Spanish. (0.55 Billion)

If I got to pick the universal language, it would be Lojban (a few hundred speakers), but that is not a realistic goal, teaching the other 6 Billion people a language that is already spoken by 1/7th of the population is at least plausible.

> Why would you want that...

Why would, you not want that?! Many popular programming languages are based on array indexing through pointer arithmetic, having a variable width encoding there is a horrible idea, because you have to iterate through the text to get to an index.

Length is the number of characters, which is just the number of bytes in ASCI, but has to be calculated by looking at every character in UTF-8.

Even if 1.2 billion seems a lot, that's still a small fraction of a world's population. So every choice of a universal language would force majority of a world to learn new one. So that's why I think winning popularity contest is a poor argument and we shouldn't look at that and focus on things like simplicity (which I don't find in English), speed of learning, consistency, expressiveness etc. I'd be happy to use Lojban (it's easier for machines too, I guess) or any other invented language. If I had to pick one from popular ones, I'd like Spanish more than English.

I was asking what are your specific usecases, which forbid you to treat UTF-8 string as a black box blob of bytes? If dealing with international code, you'd rather want to use predefined functions. If you want to limit yourself to ASCII, just do it and simply don't touch bytes >= 0x80.

And what is a character? Do you mean graphemes or codepoints? Or something else? Few years before I was thinking like you – that calculating length is a useful feature. But most often when you think about your usecase, you realise either that you don't need length or you need some other kind of length: like monospace-width, rendered-width or some kind of entropy-based amount of information. Twitter is the only case I know, where you want to really count "characters". And I find it really silly: eg. Japanese tweet vs. English tweet.

With Unicode these predefined functions have to be large and complex. You won't be able to use them on embedded systems for example.