| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by acdha 1384 days ago
	I think it would look a lot like UTF-8 with some of the legacy parts removed (e.g. drop the non-combining characters which duplicate combining character combinations). One thing to remember is that there are a LOT of edge cases in the world and you're looking at a lot of permutations when there are characters combining multiple access, or things like Emoji where they use skin tone modifiers to avoid needing to specify every permutation. I'm not sure if that would fit in a 32-bit code point, but I would also consider what that would do to file and network sizes — there are real costs to making almost every document substantially larger and while we have more headroom than we used to, I'd be still be surprised if that didn't result in noticeable performance regressions. Where I would make the change isn't Unicode itself but the APIs. All of the problems you're talking about basically come down to legacy language design where people think they're working with grapheme clusters but they're really working with code points. Making that more explicit in the tools would be good, similar to how Python 3 forced you to think about whether you wanted encoded binary data or a decoded string but there's so much history around that making it hard to do without getting a lot of griping from people who don't want to update decades of habit.