And you trust a message from an unknown source? You can't simply memcpy according to some length indicator, that's just not safe. You still have to parse and validate.
Obviously you need to check that the length doesn't go past the end of the message, but that's a trivial O(1) check. You don't have to scan the bytes of the string first to decide if they are safe to memcpy.
ie just because MessagePack is a binary format it doesn't mean you can skip the same string checks that JSON requires; which means parsing MessagePack strings is unlikely to be any faster than JSON strings (contrary to the suggestions others have implied with the "just memcpy" comments). It's just with JSON that validation is done as part of the parser (remember JSON only technically supports a subset of ASCII and any extended characters or unicode is encoded via escape codes) where as with MessagePack you'd need to do that validation as an additional step.
Integers, on the hand, might differ since JSON would need additional validation (again, backed into the parser) which MessagePack would not because MessagePack encodes the integers as binary integers where as JSON encodes them as ASCII values that would need converting back to binary integers.
Many (most?) applications do not actually care whether a byte blob of text is structurally valid UTF-8. They are either passing it around as an opaque byte blob, or already applying much stricter application-specific validation. Validating UTF-8 automatically at the serialization layer is a huge waste of cycles, especially in a big distributed system.
On closed systems where you control both the input and output, then sure (Though I’d still recommend against that particular short cut because it’s an easy way for bugs to
go undetected).
However if you’re accepting MessagePack encoded data from insecure systems (such as end users) then you absolutely should be validating your input somewhere along the pipeline and it’s usually better to do that early on.
Also it’s not generally the distributed systems you worry about when it comes to this specific degree of micro-optimisation (which is basically what this is). It’s the monolithic ones. Distributed architecture is meant to solve various problems (for example but not limited to, high availability, reduced geographical latency, single site but running on cheaper commodity hardware, etc) but often at the cost of CPU cycles. Whereas your monolithic infrastructures where you have fewer servers (such as Stack Overflows set up) would be greatly more dependant on reducing computational overhead where corners could be cut. However they’d also be significantly less likely to need networked RPCs via MessagePack anyway (simply due to the monolithic design of their architecture).
buffer_len could be larger than the message, copying some incorrect things into memory.
Similar to HeartBleed, where there wasn't validation on the heartbeat message, and the server would echo back buffer_len instead of just what was sent.
I believe author intended buffer_len to be the length of incoming buffer (size of HTTP payload, number of bytes read from file, length of the database entry, etc...). So the worst that can happen is that entire input message is consumed -- like a JS payload which missed closing quote.
I can think of a very contrived situation where this can be a problem, but in most cases this will be perfectly safe.
None of those are serialization schemes. XML can be used for serialization, but if you look at the whole ecosystem it is a Turing-complete complexity monster, so of course it isn't safe.
It depends on what constraints apply to the data. Any bit pattern could be used for an int, but to guarantee a UTF-8 string it would need to be validated.
No. Everything is 0s and 1s after all. Take for eg. a byte. It has 8 bits and by permutating all the 0s and 1s you end up with all the possible values of a signed byte; all the numbers -128 to 127. So now, if you were to copy a byte from a random memory location that byte will just contain a permutation of 0s and 1s which when interpreted as a signed int, will simply contain a number between -128 and 127.