Hacker News new | ask | show | jobs
by moatra 4036 days ago

  if you want to build a new derived datastore, you can just start a new consumer
  at the beginning of the log, and churn through the history of the log, applying
  all the writes to your datastore.
For high-throughput environments with lots of appends to the log, how do you get around the ever-increasing size of your log file? I know the traditional answer is to take a periodic snapshot and compact the previous data, but is that built in to tools like Kafka?
2 comments

There's a log compaction cleanup policy yes. Never used it myself but if I'm not mistaken it works like this: for each message you send to Kafka, you set a key with it. When Kafka does log compaction, it keeps only the last value for each key.

The other cleanup policy is to just have a retention time. After X minutes/days/weeks segments of the log are simply deleted.

That sounds great if your messages in the logs are the complete state for that key, but I'm not seeing how to use that compaction system if the messages are change events.

Is there a system designed for snapshotting the aggregate and logging the delta?

A common pattern is to publish a "checkpoint" message. Not sure if the concept is built into Kafka or not.
It's easy to store messages in HDFS or S3 for long-term storage. It's also easy to replay messages from those mediums, if you need to re-ingest data later on.
One idea is to shard the logs. By analogy with git: any given repo has a log of its commits, but you can have as many repos as you like.

It does limit throughput for any given shard, though, and then you're left with a distributed transaction problem to solve when you need to commit changes to objects in different repos.