|
|
|
|
|
by jkeiser
2173 days ago
|
|
simdjson doesn't care what is in the padding and won't modify it; it just needs the buffer the string lives in to have 32 extra addressable (allocated) bytes. It doesn't ever use the bytes to make decisions, but it may read them as certain critical algorithms run faster when you overshoot a tiny bit and correct after. Most real world applications like sockets read into a buffer, and can easily meet this requirement. If you are interested to know what it's for, the place where it parses/unescapes a string is a good example. Instead of copying byte by byte, it generally copies 16-32 raw bytes at a time, and just sort of caps it off at the end quote, even though it might have copied a little extra. Here's some pseudocode (note this isn't the full algorithm, I left a out some error conditions and escape parsing for clarity): // Check the quote
if (in[i] == '"') {
i++;
len = 0;
while (true) {
// Use simd to copy 32 bytes from input to output
chunk = in[i..i+32];
out[len..len+32] = chunk;
// Note we already wrote 32 bytes, and NOW check if there was a quote in there
if (int quote = chunk.find('"')) {
len += quote;
break;
}
len += 32; // No quote, so keep parsing the string
i += 32;
}
}
|
|
I wish there was a standardized attribute that C++ knew about that pretty much just said "hey, we're not right next to some memory-managed disaster, and if you read off this buffer, you promise not to use the results".
It is awful practice to read off the end of a buffer and let those bytes affect your behavior, but it is almost always harmless to read extra bytes (and mask them off or ignore them) unless you're next to a page boundary or in some dangerous region of memory that's mappped to some device.
This attribute would also need to be understood by tools like Valgrind (to the extent that valgrind can/can't track whether you're feeding this nonsense into a computation, which it handles pretty well).