| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btown 3864 days ago

Kafka is a replicating database, but not of randomly accessible or queryable data - rather, it's for logs where you can start at any point in time (including realtime), and play the log forwards from there. If you aggregate access logs into it, then use it to feed stream processing frameworks (or feed it into Hadoop for bulk processing), you can use it very effectively for analytics workloads.

Or you can run your entire business around events-as-ground-truth rather than SQL-style-domain-tables-as-ground-truth. Various modules consume from Kafka and write aggregated records back into Kafka - basically a directed graph of processing modules with Kafka to implement every edge - and finally there's a module that translates the final product into what is essentially a live-updating materialized view for your web backend to consume. LinkedIn does exactly this - they open-sourced Kafka and spun out Confluent to help others use this model.

Stream processing is a very powerful way of thinking about data management - with the great side effect that "migrations" of data tables are limited by your imagination, and they never run the risk of data corruption. We use this paradigm, though not Kafka itself (Mongo supports this paradigm and simplifies a lot of things if you aren't yet at LinkedIn scale), in production at http://belstone.com.

Martin Kleppman's talks are a great place to start if you want to learn more: http://www.confluent.io/blog/making-sense-of-stream-processi... is an excellent overview.