Hacker News new | ask | show | jobs
by stingraycharles 3222 days ago
As always in these types of the scenarios, the answer is: it depends. It depends on the amount of data you have. It depends upon how big the diversion from the original schema is. Etcetera.

My personal philosophy is to always leave event data at rest alone: data is immutable, you don't convert it, and you treat it like a historical artifact. You version each event, but never convert it into a new version in the actual event store. Any version upgrades that should be applied are done when the event is read; this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway. Some might argue that doing this every time an event is read is a waste of CPU cycles, but in my experience this far outweights possible downsides of losing the actual event stored at that time in the past, and this type of data is accessed far less frequently than new event data.

3 comments

I suppose you can always trade those CPU cycles off against storage and cache the N+1 version (in a separate Kafka topic or elsewhere), so now reading the latest-version data is fast, yet you still retain the original data intact, at the expense of more storage. This does complicate the storage though, as you now have multiple days a sources, but nothing that can't be solved.
Bingo. If you equate this technique to a DB migration, you could have "up" and "down" directions for the translations from version N <=> N+1.

Then if you have 90% confidence you'll only ever need to replay the upgraded stream, you can upgrade it and destroy the previous version.

If at some point (the remaining 10%) you need to rescue the old stream, you can run the "down" direction and rehydrate the old version of the stream.

It sounds good in theory. In practice, I haven't heard much around running backwards migrations on a data warehouse / massive collection of events but I'm sure some out there already do it.
I suppose one needs to take care that migrations are never lossy so that the full information for upgrading or downgrading a version is available.
Yeah, that's the challenge. For instance, how do you handle when a column was one data type but then down the road was changed to another type when the two aren't cross compatible or could potentially break?
You could retain this info in a meta field of flexible type. For a DB, it could a JSON type. For messages, it could be an extra _meta field on the message that the systems themselves ignore.
> this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway

Note that this "good practice" already has a name, it is usually called "migrations".

Migrations may be as simple as SQL Update / ALTER TABLE statements. But they may also be transformation of JSON/XML/... structures, or any other complex calculation.

It may not be the best term, as "migrations" usually imply that the result is written back to the data store. But apart from that, I don't see any problem using this term here.

Ok this makes sense.

It matches what some others have been saying as well. Thanks

There's a whole world out there about this kind of stuff. Take a look at CQRS and some of the posts by Greg Young; they're highly informative and one of the first people to really capture this way of dealing with data properly.
I wonder what happened to Greg Young. I think he had a book in the pipeline (which I was looking forward to), but as far as I can glean from social media, some burnout related stuff happened.