| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arp242 1115 days ago

In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.

A codepoint is the "smallest useful addressable unit" when dealing with Unicode text, so it makes sense that's the default.

It's also comparatively expensive to address grapheme clusters.

1 comments

lmm 1112 days ago

> In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.

I can see that iterating through by codepoint could be useful for some of those cases, but I still can't see why you'd ever want to index by codepoint?

link

arp242 1109 days ago

For the same reason you want to index anything: to slice, remove, etc. stuff. e.g. to replace a skin tone in an emoji: "str[i] = 0x1f3ff", or to insert one: "str = str[:i] + 0x1f3ff + str[i:]".

link

lmm 1105 days ago

But that's a pointlessly inefficient way to do it - surely what you want there is to iterate and transform rather than scan through and then slice? (And don't you need to group by extended grapheme cluster rather than codepoint anyway for that to make sense?)

link