UTF-8 encodes a unicode codepoint into 1, 2, 3, or 4 bytes. Assuming that you have a valid UTF-8 encoding of a codepoint, then the first byte tells you how many bytes are in the encoding. 0-127 inclusive means one byte, 192-223 means 2, 224-239 means 3, and 240-247 means 4. If the first byte is 0xC0 (192), then the sequence is two bytes long. However, not every 2-byte sequence that starts with 0xC0 is valid UTF-8. The uppermost bits of the second byte must be `10` in a valid 2-byte UTF-8 sequence. 0x27 does not meet that criteria, so `0xC0 0x27` is not valid UTF-8. If your escape function operates at the level of unicode codepoints but doesn't actually verify that they're valid, you end up copying a single quote into your "escaped" buffer that downstream parts of the code will hit.
The funny part is that not having any Unicode support in this part of the code and treating the data as ASCII (plus mistery bytes) would have worked correctly.
A PHP app called a Postgres library function to "escape strings" for use in Postgres, and that called a function to get a utf8 string length, but the function was bullshit:
> The PQescapeStringInternal method doesn’t actually validate that the string it is parsing with pg_utf_mblen is valid Unicode. So, instead, it just takes the length of 2, and grabs the next byte.
So the bug was a shitty function in a generic open source library which was probably never properly tested or fuzzed, which ended up letting attackers move laterally through the database. And this is one reason you want full test coverage; tiny stupid functions matter.
(Another fix for this is to enforce at the boundaries of every function that the input data has been "blessed" or sanitized by some other function whose purpose is just to validate that the data is what it's supposed to be. That would have to happen before escaping, and every function that uses that data would need to confirm that it got blessed. Basically you want a home-rolled strong-typing system with types (or data classes?) for all your data. But that's a lot of work, I don't expect many would do that for most apps)