Hacker News new | ask | show | jobs
by daddykotex 3215 days ago
I've had the question for a while so I'll ask it here, maybe someone can help me.

Suppose you modeled your domain with events and your stack is build on top of it. As stuff happens in your application, events are generated and appended to the stream. The stream is consumed by any number of consumers and awesome stuff is produced with it. The stream is persisted and you have all events starting from day 1.

Over time, things have changed and you have evolved your events to include some fields and deprecate others. You could do this without any downtime whatsoever by changing your events in a way that is backward compatible way.

What is the good approach to what I'd call a `replay`?

When you want to replay all events, the version of your apps that will consume the events may not know about the fields that were in the event for day one.

12 comments

As always in these types of the scenarios, the answer is: it depends. It depends on the amount of data you have. It depends upon how big the diversion from the original schema is. Etcetera.

My personal philosophy is to always leave event data at rest alone: data is immutable, you don't convert it, and you treat it like a historical artifact. You version each event, but never convert it into a new version in the actual event store. Any version upgrades that should be applied are done when the event is read; this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway. Some might argue that doing this every time an event is read is a waste of CPU cycles, but in my experience this far outweights possible downsides of losing the actual event stored at that time in the past, and this type of data is accessed far less frequently than new event data.

I suppose you can always trade those CPU cycles off against storage and cache the N+1 version (in a separate Kafka topic or elsewhere), so now reading the latest-version data is fast, yet you still retain the original data intact, at the expense of more storage. This does complicate the storage though, as you now have multiple days a sources, but nothing that can't be solved.
Bingo. If you equate this technique to a DB migration, you could have "up" and "down" directions for the translations from version N <=> N+1.

Then if you have 90% confidence you'll only ever need to replay the upgraded stream, you can upgrade it and destroy the previous version.

If at some point (the remaining 10%) you need to rescue the old stream, you can run the "down" direction and rehydrate the old version of the stream.

It sounds good in theory. In practice, I haven't heard much around running backwards migrations on a data warehouse / massive collection of events but I'm sure some out there already do it.
I suppose one needs to take care that migrations are never lossy so that the full information for upgrading or downgrading a version is available.
Yeah, that's the challenge. For instance, how do you handle when a column was one data type but then down the road was changed to another type when the two aren't cross compatible or could potentially break?
> this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway

Note that this "good practice" already has a name, it is usually called "migrations".

Migrations may be as simple as SQL Update / ALTER TABLE statements. But they may also be transformation of JSON/XML/... structures, or any other complex calculation.

It may not be the best term, as "migrations" usually imply that the result is written back to the data store. But apart from that, I don't see any problem using this term here.

Ok this makes sense.

It matches what some others have been saying as well. Thanks

There's a whole world out there about this kind of stuff. Take a look at CQRS and some of the posts by Greg Young; they're highly informative and one of the first people to really capture this way of dealing with data properly.
I wonder what happened to Greg Young. I think he had a book in the pipeline (which I was looking forward to), but as far as I can glean from social media, some burnout related stuff happened.
Here's a pretty good, even if a bit too verbose, explanation of various issues and solutions related to event versioning: https://leanpub.com/esversioning/read.

The text is written by Greg Young - the lead on the EventStore [1] project.

[1] https://geteventstore.com/

thank you
I encountered that problem. The ad hoc fix, was to have a version field in each event and functions that translate the old event into new event(s). The code that processes the events only processes events of the current version. If your old events had been denormalized this might result into repetition of events when splitted.
To add to that, you can treat it like you would schema migration on databases: implement v1 to v2, v2 to v3, etc... and replay the migrations in order to migrate from whatever version of the event is to the latest version. This allows keeping event migration code as immutable as the event versions it migrates between.
Ok, thanks for the pointer
I've often wondered the same.

In data warehousing, particularly the Kimball methodology, if descriptive attributes are missing from dimensions, for example, it is common to represent them using some standard value, like, "Not Applicable" or "Unknown" for string values. For integers, one might use -1 or similar. For dates it might be a specially chosen token date which means "Unknown Date" or "Missing Date".

It doesn't solve the problem of truly unknown missing information, but it at least gives a standard practice for labeling it consistently. Think of trying to do analytics on something where the value is unknown?? Not too easy, but at least it is all in one bucket.

Certainly, if past values can be derived, even if they were not created at the time the data was originally created, that is one way of "deriving" the past when it was missing. But, otherwise, I don't think there is any other way to make up for the lack of data/information.

The best way I've seen it done, is to version all of your schemas and have a database that signals all the transformations needed to be done for any given schema version. That way, when your reading a particular event, you can query for all the operations needed to be done on such event and perform them.
What you are referring to is called a "temporal database". Your specific example is called a "bitemporal database".

https://en.wikipedia.org/wiki/Temporal_database

https://en.wikipedia.org/wiki/Bitemporal_Modeling

If the fields are changing then you effectively have DDL and migrations in your code already... so decouple them and version the schema officially. Then record these schema changes as events in the same event stream.

Build a view on against these schema change events as a table of schema version by timestamp to allow for parsing any arbitrary event.

Recently asked on the Kafka users mailing list https://lists.apache.org/thread.html/82692004eb2292e1240c339...
Thanks for sharing, it makes me realize that our messages are not independent.
I highly recommend you look into gRPC.

Building apps using event sourcing, CQRS and microservices can easily become hell if the data models are not thought through.

I thought RPC and CQRS are diametrically opposite patterns. (Although you can use RPC in a CQRS context, but only as a transport/encapsulation layer (so the response says "request queued" or "error", but does not divulge domain specific information ("item created", "item not created")).
I'm also wondering: how one deal with changes to events when using KSQL?
Mind clarifying what you mean by changes to events?

If you create a STREAM or TABLE using KSQL, it makes sure that they are kept updated with every single event that arrives on the source Kafka topics.

That's what you'd expect in a true event-at-a-time Streaming SQL engine, which is what KSQL is.

Suppose you build a STREAM or TABLE from a topic and assume a field in the event is `id`. Later on, you introduce an update to this event where where your replace `id` by `user_id`, how is KSQL reacting?
Don't persist the stream. The problem gets a lot easier if you stop thinking of a message bus as a data store.
How do you do a replay if you don't keep the event somewhere? I did not mean that messages were to be stored in the bus?