Hacker News new | ask | show | jobs
by mwcampbell 1486 days ago
I suppose it's safest to use a binary format where variable-length fields are prefixed with their length.
4 comments

Assuming properly-created data, yes. You aren't immune to problems but you will reduce them, especially in a memory-safe language.

Unfortunately, in a security context, that is not only not guaranteed, but will be actively attacked, so in practice I'm not sure it buys you that much from a security perspective. A net positive, I think, but certainly not enough that you ca metaphorically kick back and enjoy your lemonade.

The binary format is one of the oldest of security vulnerabilities, by simply claiming a length of larger than the buffer allocated in the C program, though I'm inclined to credit that particular joy to C and not the data itself. Nowadays there aren't many languages where simply claiming to be really long will get you anywhere like that.

More generally, if you want to include a block of untrustworthy structured data in a protocol, it’s very much preferable to do so in a way that does not require inspecting the data in question to figure out where it ends and thus where the outer protocol resumes.

English is not immune. Think about “who’s on first” — there is no way to distinguish the untrustworthy name “who” from a grammatical part of the conversation.

Sure if you like ingesting 4GB records. There is nothing inherently safer in binary formats. It's easy to write parsers that can handle properly formatted files, it is when you're dealing with corrupt or misformed files that everything gets complicated.
> There is nothing inherently safer in binary formats.

Sure there is. Barring a pathologically bad wire format design, they’re easier to parse than an equivalent human editable encoding.

Eliminating the human-editing ability requirement also enables us to:

- Avoid introducing character encoding — a huge problem space just on its own — into the list of things that all parsers must get right.

- Define non-malleable encodings; in other words, ensure that there exists only one valid encoding for any valid message, eliminating parser bugs that emerge around handling (or not) multiple different ways to encode the same thing.

Define non-malleable encodings; in other words, ensure that there exists only one valid encoding for any valid message, eliminating parser bugs that emerge around handling (or not) multiple different ways to encode the same thing.

I've said similar things to this before. E.g. if you want a boolean, there's nothing simpler and less error-prone than a single bit. It represents exactly the values you need; nothing more and nothing less. You could take a byte if you didn't want to pack, and use the "0 is false, nonzero is true" convention, which is naturally usable in a lot of programming languages; that way there are 256 different values, but the set of inputs is still small and finite with each one having a defined interpretation.

Sure, until someone sets the prefix to 100MB large, and sends zero bytes of data :)
Which would be a lot easier to catch by bounds checks in the language / data types used / sanitizers / fuzzers / static analysis than cases like this where you can have two implementations seemingly successfully parse the data but disagree on the result.