| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nanidin 793 days ago
	It says it sorts one byte at a time. I think this would break for anything not utf-8.

2 comments

ts4z 793 days ago

Seems like it should work for arbitrary byte strings (any charset, any encoding)but obviously the performance characteristics will differ because of non-uniform distribution. But that happens even in ASCII.

link

nanidin 791 days ago

Yes, you’ll get something sorted based on the bytes in the string but it won’t be lexicographically correct - for example, à will be sorted after b.

link

brirec 793 days ago

This would even break UTF-8, since multi-byte characters are a thing!

link

dumbo-octopus 793 days ago

How would that break anything? The strings aren't being split.

link

ursusmaritimus 793 days ago

Lexicographic encoding of UTF-8 byte sequences matches lexicographic order of the sequence of Unicode code-points. So you can sort UTF-8 strings as byte strings. Not that sorting by code-points has much meaning, but you can use the Unicode collation algorithm first.

link

mzs 792 days ago

% printf "%s\n" A B | sort

% printf "%s" A B | xxd -b -c4

00000000: 11110000 10011101 10010000 10110100 ....

00000004: 11110000 10011101 10010000 10110101 ....

% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps | xxd -b -c4

00000000: 10010000 10010000 10011101 10011101 ....

00000004: 10110100 10110101 11110000 11110000 ....

% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps

????????

link

remram 792 days ago

TIL, thanks. That makes sense given how the length prefixes look like but I never thought about it. I wonder if this was by chance or if the UTF-8 creator thought about it.

link

nanidin 791 days ago

Indeed, and is à less than b? Not in Unicode!

link

EmilyHughes 793 days ago

UTF-8 is downward compatible to ASCII, so anything that is just a standard character (like every character in this comment) is just a byte.

link