| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by defen 465 days ago
	UTF-8 encodes a unicode codepoint into 1, 2, 3, or 4 bytes. Assuming that you have a valid UTF-8 encoding of a codepoint, then the first byte tells you how many bytes are in the encoding. 0-127 inclusive means one byte, 192-223 means 2, 224-239 means 3, and 240-247 means 4. If the first byte is 0xC0 (192), then the sequence is two bytes long. However, not every 2-byte sequence that starts with 0xC0 is valid UTF-8. The uppermost bits of the second byte must be `10` in a valid 2-byte UTF-8 sequence. 0x27 does not meet that criteria, so `0xC0 0x27` is not valid UTF-8. If your escape function operates at the level of unicode codepoints but doesn't actually verify that they're valid, you end up copying a single quote into your "escaped" buffer that downstream parts of the code will hit.

1 comments

account42 464 days ago

The funny part is that not having any Unicode support in this part of the code and treating the data as ASCII (plus mistery bytes) would have worked correctly.