Hacker News new | ask | show | jobs
by yolesaber 3705 days ago
Let's say you have a CMS which pushes content to your site. You also want to make the whole site searchable, so you index your content into (e.g) Elasticsearch. Kafka is great for this because you can put the content onto Kafka's message queue and then have a service reading from it which then put's it into Elasticsearch. It scales well, too. So let's say your site takes off and you have hundreds of articles published a day (not to mention updates, deletions etc) - these events can all be sent to kafka and it will maintain the order as well as still be fast. You can also have many many services reading (consuming) from it simultaneously and it will handle it nicely.

Basically, if you want to get data from one place to another and care about order, Kafka is a good solution. It acts as a middleman between services.

1 comments

Hm but why would you not send it directly to ElasticSearch?
Kafka shines when you have multiple services that have data to publish and multiple services that need to read that data stream. If you have three services and they write to ES, publish metrics to some other store, and log events to the db, you could instead write that all to Kafka, and individual consumers can use the data (for instance, to put into ES). On the origin-service side, it has one integration point; it does not need to know about ES. Now let's say that your users want a near real-time dashboard of their data changes on your multiple services. All you do is make a new consumer from Kafka. You don't add it to your three services. Kafka simplifies your service relation graph.
Well I definitely support the example of using Kafka for analytics with a streaming solution like Flink or Spark, etc. However I asked the "why not directly to ES" question because the example of using Kafka just as a layer in front of ES I felt it kinda painted Kafka layer as something "we could do because we can, not because we need to".
Because then you have the problem of dual-writes.

https://martin.kleppmann.com/2015/05/27/logs-for-data-infras...

One key ability is to batch updates.

We just implemented Kinesis (AWS service similar to Kafka) to reduce load on our Elasticsearch database (~50GB) when running hundreds of individual jobs.

Individual tasks (implemented in Celery, actually running off Redis) push to a Kinesis stream which is then consumed in batches by a very simple processor.

When you need to reindex, all the original writes are still on your Kafka topic and the consumer that fed ElasticSearch and replay from the start.
because: * you can do ES indexing async * having articles index instantly is not critical (I guess)
Well ES is pretty fast by itself - lots of people use it to store log entries(ELK stack) and every log line triggers an indexing event in ES. Introducing Kafka into the mix just seems like an unnecessary complication.
It's not unnecessary if ES is just one of the endpoints. Kafka shines because it can accommodate a ton of consumers - so you can write once to kafka and then use it populate ES, databases, whatever. Furthermore, what happens if you need to re-index (say, if you update a mapping to an existing object)? It becomes trivial to reindex by replaying all the data from Kafka into ES thus saving you a lot of time.

If you are just dumping into ES, then yes, probably not the best tool (though it wouldn't necessarily hurt) - just use the HTTP API for that. However if you want to build a robust pipeline for multiple services or think you'll be needing to scale the feed into ES, Kafka is useful.