Hacker News new | ask | show | jobs
by pegasuscollins 3109 days ago
Have you seen the relevant reference in RFC3629, page 2? It's explicitly listed there as a feature: "The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers"

Agree that specifying the keys to be ordered >by unicode code points< instead of >lexicographically< would be less ambiguous though.

1 comments

I definitely meant ordering by Unicode code points. Someone very helpfully opened an issue and we're trying to figure out the right wording there: https://github.com/seagreen/Son/issues/13
Looks good. I still foresee interoperability problems between implementations, though. It just is too easy to mix up the ‘sort by key’ and ‘escape various control characters’ steps (CR sorts before ascii characters, but “\n” sorts after it)

Even if the spec requires it, I fear implementations will also canonicalize strings differently, breaking sort order.