| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cptwunderlich 2297 days ago
	And you trust a message from an unknown source? You can't simply memcpy according to some length indicator, that's just not safe. You still have to parse and validate.

5 comments

kentonv 2297 days ago

Obviously you need to check that the length doesn't go past the end of the message, but that's a trivial O(1) check. You don't have to scan the bytes of the string first to decide if they are safe to memcpy.

link

laumars 2297 days ago

You might want to validate those byte sequences are valid character encodings.

link

kelnos 2297 days ago

You should be doing that with JSON as well, so this isn't a pro/con of either format.

link

laumars 2297 days ago

That was obviously my point.

ie just because MessagePack is a binary format it doesn't mean you can skip the same string checks that JSON requires; which means parsing MessagePack strings is unlikely to be any faster than JSON strings (contrary to the suggestions others have implied with the "just memcpy" comments). It's just with JSON that validation is done as part of the parser (remember JSON only technically supports a subset of ASCII and any extended characters or unicode is encoded via escape codes) where as with MessagePack you'd need to do that validation as an additional step.

Integers, on the hand, might differ since JSON would need additional validation (again, backed into the parser) which MessagePack would not because MessagePack encodes the integers as binary integers where as JSON encodes them as ASCII values that would need converting back to binary integers.

(hint: read the message I'm replying to).

link

kentonv 2297 days ago

Many (most?) applications do not actually care whether a byte blob of text is structurally valid UTF-8. They are either passing it around as an opaque byte blob, or already applying much stricter application-specific validation. Validating UTF-8 automatically at the serialization layer is a huge waste of cycles, especially in a big distributed system.

link

laumars 2297 days ago

On closed systems where you control both the input and output, then sure (Though I’d still recommend against that particular short cut because it’s an easy way for bugs to go undetected).

However if you’re accepting MessagePack encoded data from insecure systems (such as end users) then you absolutely should be validating your input somewhere along the pipeline and it’s usually better to do that early on.

Also it’s not generally the distributed systems you worry about when it comes to this specific degree of micro-optimisation (which is basically what this is). It’s the monolithic ones. Distributed architecture is meant to solve various problems (for example but not limited to, high availability, reduced geographical latency, single site but running on cheaper commodity hardware, etc) but often at the cost of CPU cycles. Whereas your monolithic infrastructures where you have fewer servers (such as Stack Overflows set up) would be greatly more dependant on reducing computational overhead where corners could be cut. However they’d also be significantly less likely to need networked RPCs via MessagePack anyway (simply due to the monolithic design of their architecture).

link

lwf 2297 days ago

As long as you know the length of the entire buffer, you just ensure that:

  current_addr + message_len - start_addr < buffer_len

Or am I missing something?

link

maxwindiff 2297 days ago

Invalid unicode sequences?

link

diabeetusman 2297 days ago

buffer_len could be larger than the message, copying some incorrect things into memory.

Similar to HeartBleed, where there wasn't validation on the heartbeat message, and the server would echo back buffer_len instead of just what was sent.

link

theamk 2297 days ago

I believe author intended buffer_len to be the length of incoming buffer (size of HTTP payload, number of bytes read from file, length of the database entry, etc...). So the worst that can happen is that entire input message is consumed -- like a JS payload which missed closing quote.

I can think of a very contrived situation where this can be a problem, but in most cases this will be perfectly safe.

link

bmn__ 2297 days ago

https://capnproto.org serialisation scheme skips the decoding. Does that make it not safe?

link

tebeka 2297 days ago

No serizalization is safe

- https://docs.microsoft.com/en-us/security-updates/securitybu... - https://en.wikipedia.org/wiki/Billion_laughs_attack - https://en.wikipedia.org/wiki/Zip_bomb - ...

link

strbean 2297 days ago

None of those are serialization schemes. XML can be used for serialization, but if you look at the whole ecosystem it is a Turing-complete complexity monster, so of course it isn't safe.

link

DougBTX 2297 days ago

It depends on what constraints apply to the data. Any bit pattern could be used for an int, but to guarantee a UTF-8 string it would need to be validated.

link

kingofpandora 2297 days ago

Genuine question - is it dangerous to memcpy X bytes that we know must be interpreted as, say, an integer?

link

byte1918 2297 days ago

No. Everything is 0s and 1s after all. Take for eg. a byte. It has 8 bits and by permutating all the 0s and 1s you end up with all the possible values of a signed byte; all the numbers -128 to 127. So now, if you were to copy a byte from a random memory location that byte will just contain a permutation of 0s and 1s which when interpreted as a signed int, will simply contain a number between -128 and 127.

link

kingofpandora 2297 days ago

That's what I thought ...

link

tangent128 2297 days ago

Potentially. Most network protocols are big-endian while x86 is little-endian.

link

hinkley 2297 days ago

Keep fighting the good fight.

link