Have you seen the relevant reference in RFC3629, page 2? It's explicitly listed there as a feature: "The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers"
Agree that specifying the keys to be ordered >by unicode code points< instead of >lexicographically< would be less ambiguous though.
I definitely meant ordering by Unicode code points. Someone very helpfully opened an issue and we're trying to figure out the right wording there: https://github.com/seagreen/Son/issues/13
Looks good. I still foresee interoperability problems between implementations, though. It just is too easy to mix up the ‘sort by key’ and ‘escape various control characters’ steps (CR sorts before ascii characters, but “\n” sorts after it)
Even if the spec requires it, I fear implementations will also canonicalize strings differently, breaking sort order.
Looking at the reference implementation, it contains
So, the sorting is done before conversion to UTF-8.Relevant line, I think is
So, it uses Data.String’s sort order, which seems to be to lexicographic by Unicode code point (https://stackoverflow.com/a/3126287)⇒ implementations cannot sort the UTF-8 byte st sequences lexicographically. I think that’s a bad choice (if it was a choice and not an oversight)
I’ve never written a single line of Haskell, so corrections welcome.