| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcranmer 2388 days ago

Windows-1252/ISO-8859-1 (the two charsets are so commonly conflated that it's often best to treat them as one) was the dominant [non-ASCII] charset of the web until around 2007 or 2008, and their prevalence more recently is only about 5%.

A collection of Usenet messages gathered in 2014 (see http://quetzalcoatal.blogspot.com/2014/03/understanding-emai... for full details) showed that out of 1,000,000 messages, about 530,000 were actually ASCII; 270,000 were ISO-8859-1 or Windows-1252; and only 75,000 were UTF-8. More modern numbers would probably show higher UTF-8 counts, although Usenet is notoriously conservative in terms of technology.

What I'm trying to elucidate here is that the rise of UTF-8 isn't because most text is ASCII, but because there's been a rather more concerted effort to default content generation to UTF-8 and treat other charsets only as legacy inputs. Well, with the exception of the Japanese, who tend to be strongly averse to UTF-8. (I've been told that Japanese email users would rather have their text get silently mangled than silently converted to UTF-8 because you're quoting an email with a smart quote [not present in any of the 3 Japanese charsets], whereas every other locale was happy changing the default charset for writing to UTF-8).