| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by londons_explore 1773 days ago

In reality, such continuous mapreduce jobs lead to unchangeable code and versioning nightmares.

Imagine you want to change part of your pipelines logic. Now either all data needs to be reprocessed (expensive, depends on you having retained past data, will your low latency continuous pipeline keep running while the backlog is cleared, is the code really idempotent or will a rerun lead to half the records failing to be reprocessed?). Or you need to not reprocess old data (now there is inconsistency in historical records, what do you do if you make a bad release which just outputs zeros?).

In any real organisation, you'll need both approaches. And it'll end up a mess with versions of code and versions of data. Now some customer comes along and demands a GDPR deletion of their session records and you have no way to even find all the versions of all the copies of the records let alone delete them and make everything else consistent...

1 comments

psfried 1773 days ago

Versioning is indeed an issue, but that's the case for anything with long-lived state. Our current rely on JSON schemas, TypeScript, and built-in testing support to help ensure compatibility. Those things actually help quite a bit in practice. But I think we may also want to build some more powerful features for managing versions of datasets, since there's a real need there, regardless of the processing model you use to derive the data.

link