| Anyone here built or used a data pipeline that validates event structures or pipes data to multiple destinations? - Did you use JSON, Thrift, Protobuf, Avro or something else to define the schema for the events in the stream? - Did you version the schema? - Did you version each event or object in the schema? - What type of versioning did you use? (E.g. semantic, incremental counters, git hashes, another type of hash, etc.) - Any other cool things you'd like to share? |
The current path we are going down is looking at using a 'nested' JSON structure that enables sub-processes in the various systems to inherit values from the parent. Something similar to:
{ "type": "schemaX", "version": 1, "payload": { "k1": "v1", "type": "schemaY", "version": 3, "payload": { "kk1": "v1" } } }
The structure itself will be documented using JSONSchema and hopefully we will be able to verify the validity of events as they are processed, though this might be too expensive to do in high volume scenarios.
The goal is to then build a low latency router that takes in routing policies to forward subsets of events to further data pipelines. The policies themselves will be defined using some kind of DSL (possibly using JSONPath?).
As a whole, this seems to be a fairly common problem[0] that other companies are trying to solve, but a lot of the low level details are not spoken about. One thing that does seem to be a common component to a service like this is Apache Flink.[1]
[0] Netflix Keystone - https://www.youtube.com/watch?v=sPB8w-YXX1s [1] https://flink.apache.org/