| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sapek 4175 days ago

There are two advantages to generating code at runtime:

1) In some scenarios you have information at runtime that allows you do generate much faster code. The canonical example is untagged protocols, where serialized payload doesn't contain any schema information and you get schema at runtime. Bond supports untagged protocols (like Avro) in addition to tagged ones (like protobuf and Thrift) and the C# version generates insanely fast deserializer for untagged.

2) It allows programmatic customizations. If the work is done via codegen'ed source code then the only way for user to do something custom is to change the code generator to emit modified code. Even if codegen provides ability to do that, it is very hard to maintain such customizations. In Bond the serialization and deserialization are composed from more generic abstractions: parsers and transforms. These lower level APIs are exposed to the user. As an example imagine that you need to scrub PII information from incoming messages. This is a bit like deserialization, because you need to parse the payload, and a bit like serialization, because you need to write the scrubbed data. In Bond you can implement such an operation from those underlying abstractions and because you can emit the code at runtime you don't sacrifice performance.

BTW, Bond allows to do something similar in C++. The underlying meta-programming mechanism is different (compile-time template meta-programming instead of runtime JIT) but the principle that serialization and deserialization are not special but are composed from more generic abstractions is the same.

1 comments

Someone 4175 days ago

Ad 1): if you don't know at compile time what kind of objects you will deserialize, you need to do more than generate the serialization code; you also need to generate the class. So, basically, you need the entire schema compiler. I still don't see why separating those two generation steps is a gain.

Ad 2): does this mean that one can also do efficient schema migration at deserialization time (rename fields, add fields with default values), or that one can deserialize to something else than the class that got generated when the schema was compiled?

sapek 4175 days ago

1) I didn't do a good job explaining this. You are right that if you want to materialize an object during deserialization you need to know a schema at build time to generate your class. But the crucial things is, and this is true of all similar frameworks, you don't know the schema of the payload at that point. One big reason you use something Protobuf, Thrift or Bond is to get forward/backward compatibility. What this means in essence is that deserialization is always mapping between schema you built your code with and schema of the payload. There are two common ways to do that mapping: (a) payload has interleaved schema information within data and you perform branches at runtime based on what you find in the payload (this is what Protobuf, Thrift and Bond tagged protocols do) (b) you get schema of payload at runtime and use that information perform the mapping (this is what Avro and Bond untagged protocol do). The latter case is particularly suitable for storage scenarios: you read schema from file/stream header and then process many records that have that schema. This is the case where having ability to emit code at runtime results in a huge performance win: you JIT schema-specific deserializer once and amortize this over many records.

2) You can do both. You can also do type safe transformations/aggregations/etc on serialized data w/o materializing any object.