Heroku Kafka | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Heroku Kafka (heroku.com)
	276 points by sixwing 3705 days ago

16 comments

mbseid 3705 days ago

As a former user of Kafka, this is awesome and it would have been a huge help for our company if this was available then. I'm glad to hear that a company is offering Kafka as opposed to other propriety versions(AWS Kinesis etc).

One thing is odd though, there is no mention of disk space at all and only a configuration of retention time. One of Kafka's best features is the use of disk to store large amounts of messages, you are not RAM bound. Heroku seems to only allows you to set retention times? This could be awesome if they are giving you "unlimited" disk space, but could also be a beta oversight. Interested to see how this progresses.

uhoh-itsmaciek 3704 days ago

Hi, I'm Maciek and I work on the Heroku Kafka team. You don't have to think about disk space--it's on us to make sure there's enough to satisfy the retention settings you configure. We're excited to provide another great open-source project as a managed service!

mbseid 3704 days ago

Thanks for the update. That is awesome. Excited to see what people do with it.

ktamura 3704 days ago

Don't forget that Heroku is the original multi-tenant shop. I wouldnt be surprised if a single Kafka instance stores multiple customers's messages and elastically scale as more customers/data is added.

sixwing 3704 days ago

I'm Rand Fitzpatrick, and this is one of the products I work on at Heroku. None of our current Kafka offerings are multi-tenant.

jonahx 3705 days ago

> What is Kafka?

> Apache Kafka is a distributed commit log for fast, fault-tolerant communication between producers and consumers using message based topics. Kafka provides the messaging backbone for building a new generation of distributed applications capable of handling billions of events and millions of transactions

Can anyone translate this into meaningful English for me?

simonw 3705 days ago

You should read this: https://engineering.linkedin.com/distributed-systems/log-wha... - it's long, but it's one of the most impactful essays I've read on software engineering in years.

Jgrubb 3704 days ago

I'm not able to find the one that made the light bulb go on in my head, but Martin Kleppman gives some good conf talks around this topic. This one looks promising - https://www.youtube.com/watch?v=GfJZ7duV_MM

rakoo 3704 days ago

I liked this one [http://www.confluent.io/blog/turning-the-database-inside-out...] a lot, because it highlights the strength of Kafka beyond a simple distributed message queue.

koolba 3705 days ago

+1 regarding reading that essay on Kafka. The explanations, illustrations, evolution and use case are all well thought out.

superuser2 3705 days ago

You send a message (for example some JSON) to a Kafka topic. Any number of clients subscribe to that topic with a specific start time-stamp. Pluck a message off the queue, compute with it, send an acknowledgement. Kafka provides strong assurances that all readers get all the messages and report success (it retries otherwise), even if some participants come and go.

Very useful if, say, you have some real world event and dozens of different micro services need to do something about that event, independently.

You can also just use it for logging.

kod 3704 days ago

This is a really inaccurate description. Messages aren't indexed by timestamp in any meaningful way; that feature is currently under development. Messages don't need to be acknowledged, it's the client's responsibility to track what messages have been consumed. The server provides some facilities to make that easier, but ultimately clients can request whatever messages they want (repeating, skipping, whatever), as long as the messages haven't expired out of retention.

If you're actually interested in Kafka, just read the documentation, it's quite good.

superuser2 3704 days ago

Interesting. Those are features my previous employer had and used extensively, in particular stateless clients. I guess we added those layers ourselves.

ashitlerferad 3705 days ago

Sounds like a mailing list.

okaram 3704 days ago

Yes, but for applications ; instead of doing a remote call to your other system to create an order or send an email, you just stick the data in a queue, and the other system does it when it feels like it

strictnein 3704 days ago

We do exactly that for our emails. We're not ultra high volume, but we send millions a day.

manigandham 3705 days ago

Kafka is a distributed logging system. Write lots of data very fast by using sequential I/O. Consuming apps can then read this just as fast and maintain their own state (of where they last read up to) which allows for multiple fast and simple consumers and an easy way to have a lasting "log" of all the data.

tschellenbach 3705 days ago

It's a message queue. You use it for everything you want to do outside of the general request cycle. IE: Making API calls, priming cache, sending emails, etc..

Biggest competitors of Kafka are RabbitMQ and amazon SQS.

manigandham 3705 days ago

It's not a message queue, it's a logging system. Queues are meant for ephemeral messages that expire once consumed. Kafka is immutable log storage that can be read as many times as necessary by consumers.

Biggest competitors would be AWS Kinesis, Azure EventHubs and Google PubSub.

saryant 3705 days ago

The biggest difference, IMO, is that Kafka is typically used when a message will be consumed by multiple consumers, whereas RabbitMQ or SQS generally send a message to a single consumer.

We use it to ingest ~40mb/s and fan it out to a number of consuming applications.

I'll also add that if you put some thought behind your topic replication and partitioning you can build some incredibly resilient applications. Also that "immutable" isn't necessarily true, it's common for Kafka topics to roll off messages based on time or size. (That's just to clarify for those not familiar with Kafka. I realize that you mean messages are not deleted or modified once written to a topic, other than by topic retention settings)

manigandham 3705 days ago

It's very easy to setup multiple consumers with RabbitMQ topics though and SQS is a very basic queueing system that just doesn't support much.

To me, Kafka is just meant for much larger magnitudes of scale and persistence (of the entire log of messages for however long you need) as a core feature.

Google's PubSub is still the best blend of traditional queue semantics with Kafka scale and persistence though.

asymmetric 3705 days ago

It's incorrect to say that RMQ sends message to consumers. What it does is it routes messages to queues/exchanges. It's then entirely up to you to decide how many consumers will effectively consume them, I.e. You can have as many consumers as you wish.

dkersten 3704 days ago

messages are not deleted or modified once written to a topic, other than by topic retention settings

Except if you have log compaction turned on, I guess.

A key "selling point" of Kafka for me is that each consumer can decide from when they wish to receive messages. That is, you can replay the messages.

amock 3705 days ago

It's a distributed message queue.

gshx 3705 days ago

It can be used as a queue but the bigger benefit is for streaming use cases. One of the key differences, among others, is that streaming assumes somewhat faster consumers as opposed to queueing. There's also the pub-sub use-case which is generally considered separate from that of a queue (considered a point to point transport).

amock 3705 days ago

That is more descriptive, but it still sounds like queue functionality. Streaming processing is just a queue that gets emptied quickly and pub-sub is just a set of queues.

dkersten 3704 days ago

Kafka doesn't generally get emptied quickly, but rather retains messages for a configured time/size. Because of this, consumers can choose to replay previously consumed messages, if they wish to do so.

gshx 3704 days ago

You're right. I was mostly commenting on the common idiomatic ways ppl differentiate streams vs queues. Indeed, it can be used in both scenarios.

ianvarley 3704 days ago

I made an attempt at this last week:

https://medium.com/salesforce-engineering/the-architecture-f...

franciscop 3705 days ago

I love Heroku and everything they are doing, it's doubtless a push forward for the web as a whole. However, the pricing for hobby sites (including SSL) is crazy from a personal point of view so I'm slowly moving my projects out of it [1][2]. I wish they had some kind of "Hobby Bundle".

[1] http://umbrellajs.com/ [2] http://picnicss.com/

redtuesday 3704 days ago

Check out Red Hat's Openshift Online [1] if you haven't already. They offer 3 Gears for free, each with 512 MB Ram, 1 GB Disk space (install e.g Postgres into one gear and you have a DB with 1 GB) and you can use Lets Encrypt with their Bronze plan (which is free if you only use the 3 free gears). Depending on what your hobby sites do this could be enough.

[1] https://www.openshift.com/pricing/

polymeris 3704 days ago

If you are OK with running a >3 year old version of Postgres, that is.

Other than that, openshift is nice, though, I agree.

redtuesday 3704 days ago

Yes, would be nice if they could upgrade the version of some of the dbs/languages/etc. they support. As a user I could write my own cartridge to support a newer version of Postgres (like I did for Playframework 2.4+), but then I lose the automatic updates and have to update the gear myself.

redtuesday 3704 days ago

btw: it seems you still can't automate the cert renewal for let's encrypt on openshift online.[1]

[1] https://openshift.uservoice.com/forums/258655-ideas/suggesti...

colinbartlett 3704 days ago

I am a longtime Heroku customer on an Enterprise service plan at the moment. I brought this up to my sales rep last time he checked in.

I explained how we moved a bunch of smaller sites to S3, reluctantly because I really like having a unified platform for all our sites. But even though (or perhaps precisely because) we are spending thousands of dollars a month with Heroku, I find the $20/month SSL charge insulting. SSL is not an option anymore.

The good news is, the sales rep said this has come up a lot, they hear us, and to "stay tuned".

rthomas6 3704 days ago

Sounds like a sales rep.

omfg 3704 days ago

They were saying the same thing when we left 2 years ago.

sudhirj 3705 days ago

Their pricing for hobby sites is 7$ + 10$ DB, which is very comparable with a self setup IaaS like DO and AWS. Personally I think the developer experience is much better on Heroku and quite worth it.

SSL is a pain point, though I do empathize with them - I think they're doing something expensive for that. What I do is to use AWS Cloudfront and ACM for a free cert and site speedup - if they are personal projects the CF bill ought to be in the low few dollars anyway.

flurdy 3704 days ago

The $7 is comparable to one app per DO or AWS micro/nano server. So Heroku wins on convenience.

If you have say 10 apps then Heroku costs 10*$7, but you might still only have used 1-3 servers depending on memory use of apps etc so then Heroku looses on cost.

Naturally I got a total mix of quite a few on Heroku's classic or new free plan, some on their hobby plan, some on AWS, some on docker cloud, most proxied behind a SSL certificate running on AWS..... (https://flurdy.com/docs/letsencrypt/nginx.html)

franciscop 3703 days ago

Yeah, but I started with PHP I could just choose among many hosting companies for 5-10$/month and you get your shared space with unlimited domains, which was perfectly suited for my needs at that point. Of course as I learned more, Node and the such I needed better technology and that's why I moved to Heroku. So I'd love to see a "shared hosting for heroku" or similar. I think it will happen given some time, when the Node.js environment stabilizes more and more big players come.

balamaci 3704 days ago

DO has 5$ + VAT price for 512MB instance. I have no problem accommodating mysql + web on that.

why-el 3704 days ago

+ 7$ for a worker I think.

mintplant 3704 days ago

It looks like those are static sites. Why host them on Heroku? You could stick them on GitHub Pages [1] for free.

[1] https://pages.github.com/

nickfrostatx 3704 days ago

The S3 Free Tier can be a good alternative as well. https://aws.amazon.com/s3/

colinbartlett 3704 days ago

Even off the free tier, S3 + Cloudfront is now my static host of choice. Incredibly cheap for my low traffic side projects and free (and instant) SSL certificate setup.

dkersten 3704 days ago

AWS free tier is free for 12 months only. Its basically a "new customer" special. Github pages is not time limited.

franciscop 3704 days ago

now they are static sites built with grunt. Before they were dynamic so the documentation and tests could be joined and runned dynamically. Now I just do that before push, which had me change the organization of few things. I converted them and I'm now hosting them in Github Pages for that same reason. I learned a lot about grunt, phantomjs and SSL with cloudflare on the process though, so I'm happy with the result.

njudah 3704 days ago

There will be news on this front soon; stay tuned.

mtw 3704 days ago

I host most of my sites on Github pages (Free) and Ruby / Go / Elixir / nodejs projects on DigitalOcean. I also host various logging/analytics/mailing services on DigitalOcean without paying saas services. Total is very reasonable

desireco42 3704 days ago

And what awesome projects those are. Thank you!

franciscop 3704 days ago

Thanks :)

mateuszf 3704 days ago

If you have multiple applications though - it's possible to use one ssl terminator per domain.

tlrobinson 3704 days ago

One option for HTTPS is to just stick CloudFlare in front of it.

cachemiss 3705 days ago

Kudos to Heroku. As someone who has had to make Kafka into a managed service, I know what a pain it is (I'm not a Kafka fan for a lot of reasons) to administer in a cloud environment.

sethammons 3705 days ago

Would love to hear what you don't care for in Kafka and what alterative solution(s) you prefer.

cachemiss 3705 days ago

To clarify, my feelings towards Kafka are from the POV of someone who has had to build a managed service on top of it, which is not the common use case (for which many people seem to be happy with). Other people may have more positive experiences.

In my experience, Kafka is a solid system when you work in its wheelhouse, which is a relatively static set of servers / topics, that you add to slowly and deliberately. If you can't use something like Kinesis, then its a good choice.

In Kafka, programmatic administration is generally an afterthought. They have APIs for doing things, but they generally involve directly modifying znodes. Simple things don't work or have bugs, deleting topics didn't work at all until 0.8.2, and even now has bugs. We've seen cases where if you delete a topic while an ISR is shrinking or expanding, your cluster can get into an unrecoverable state where you have to reboot everything, and even then it doesn't always get fixed. Most of the time you are expected to use scripts to modify everything (there's a wide variety of systems out there that try to build mgmt on top of kafka).

Its dependency on Zookeeper is a pain, and limits scalability of topic / partition counts. Rebalancing topics will reset retention periods because they use the last modified ts of the segment files to check for oldness, meaning if you rebalance often, you need extra disk space laying around. ZK has some bugs with its DNS handling, which affects Kafka if you try and use DNS.

It has throttling, but its by client id, what you'd like in some cases, is to say that a node has X throughput, and have the broker be able to somewhat guarantee that throughput, and create backpressure when clients are overwhelming it. Otherwise your latency can go through the roof. You also want replication to play nice with client requests, and it doesn't (if you add a new broker and move a bunch of partitions to it, you'll light up all your other brokers while it replicates, and cause timeouts).

Its replication story can cause issues when network partitions come into play.

It's highly configurable like many Apache projects, which is a blessing and a curse, as your team has to know all the knobs, both consumer / producer / broker side.

The alternative if you are at a company with the resources to do so (mine is), is to build something that fits your use case better than Kafka, or to use a hosted service like this, or Kinesis.

emfree 3705 days ago

Thanks for the insightful comment!

> The alternative if you are at a company with the resources to do so (mine is), is to build something that fits your use case better than Kafka

I'd love to hear more about this :) What did you end up doing differently from Kafka? How's it working out for you?

Dr_tldr 3705 days ago

Your comment is highly technical, critical, but still very fair. This is why I love HN.

sethammons 3705 days ago

Thanks for the thoughtful and detailed response! Very helpful.

ChartsNGraffs 3705 days ago

For anyone wanting to play with Kafka, Spotify's Kafka container was an invaluable resource for getting me up and running with Kafka. All the Zookeeper dependencies are taken care of allowing you to just start playing with Kafka right away. https://github.com/spotify/docker-kafka https://hub.docker.com/r/spotify/kafka/

Jarmo 3705 days ago

I never tried spotify's container. Tried wurstmeister's, and was able to run it on a single server for testing purposes, but kept running into issues while clustering on different servers. Decided to use Ambari and have it do all the work for me instead.

manigandham 3705 days ago

This will be interesting to try out. I've used all the major cloud event/logging systems (Kinesis, Azure EventHubs, etc) and so far Google PubSub is the best in features and performance.

Only downside with Google Pubsub can be latency (which I'm working on fixing by building a gRPC driver) but Kafka has proven to be too complicated to maintain in-house. If heroku can provide the speed without the ops overhead, it'll be some good competition to Google's option.

Also want to note that Jay Kreps who helped build Kafka at LinkedIn is now behind http://www.confluent.io/ which is like a better/enterprise version of Kafka.

alexatkeplar 3705 days ago

Not sure why you are comparing Google Cloud Pub/Sub to Kinesis - the former is a MQ system, not a distributed commit log.

When creating a Kinesis consumer, I can specify whether I want to start reading a stream from a) TRIM_HORIZON (which is the earliest events in the stream which haven't yet been expired aka "trimmed"), b) LATEST which is the Cloud Pub/Sub capability, c) AT_SEQUENCE_NUMBER {x} which means from the event in the stream with the given offset ID, d) AFTER_SEQUENCE_NUMBER {x} which is the event immediately after c), e) AT_TIMESTAMP to read records from an arbitrary point in time.

A Kinesis stream (like a Kafka topic) is a very special form of database - it exists independently of any consumers. By contrast, with Google Cloud Pub/Sub [1]:

> When you create a subscription, the system establishes a sync point. That is, your subscriber is guaranteed to receive any message published after this point.

[1] https://cloud.google.com/pubsub/subscriber

So the stream is not a first class entity in Cloud Pub/Sub - it's just a consumer-tied message queue.

nivertech 3703 days ago

Is there something like Kinesis' AT_TIMESTAMP in Kafka?

I think the only way in to replay events in Google Cloud Pub/Sub is to create multiple subscriptions in advance, right after topic creation. But then I think you need to pay for the storage and event traversal requests.

rtehfm 3704 days ago

What are your thoughts on Kafka vs Flume?

andreasklinger 3705 days ago

For those wondering (all imo and only best guess)

The biggest advantage of kafka is that all of the heroku marketplace all of a sudden becomes "plug and play"

Essentially it's the "backend data" equivalent of what segment does for "frontend data".

Example: What's the benefit of having a graphDB service in the marketplace if most people dont want to / cant invest engineering in keeping the data in (realtime) sync.

With kafka they can establish standards that all partners can adapt to, they will simply offer piping of all heroku postgres/redis changes.

hmottestad 3705 days ago

Does anyone know if Kafka has improved on their data loss issues since tested by Aphyr? https://aphyr.com/posts/293-jepsen-kafka

A quote from the article: "At the end of the run, Kafka typically acknowledges 98–100% of writes. However, half of those writes (all those made during the partition) are lost."

lars_francke 3704 days ago

Yes, the suggestion discussed by Aphyr has been implemented. You can now set up a lower bound on the ISR size (min.insync.replicas). Together with required.acks=-1 you can wait for a message to be committed to at least min.insync.replicas nodes.

https://issues.apache.org/jira/browse/KAFKA-1555

koolba 3705 days ago

I've wondered why there isn't a "big player" in the cloud space for this. Felt like a hole.

My operating theory is that the people who would really make use of something like this have grown beyond managed offerings and would take it in house. For smaller operations Redis is more than enough for pub/sub. Ditto for SQS for externally triggered eventing.

bjt 3705 days ago

> For smaller operations Redis is more than enough for pub/sub.

I didn't find that to be so at my last job, one of those smaller operations.

With Redis you're forced to pick between two severely constrained options:

1. Use PUBLISH/SUBSCRIBE. This is nice if you want to have several listeners all receive the same message. But if a listener is down, there's no way for it to recover a message that it missed. If there is no one listening, messages are just dropped.

2. Use LPUSH/BRPOP. This is nice if you want to have several workers all pulling from the same queue, but isn't sufficient if you want to have several queues streaming from the same topic. (E.g. one listener is responsible for syncing to ElasticSearch and another one is syncing to your analytics DB.)

I strongly prefer RabbitMQ. Its model of exchanges and queues supports mixing and matching these semantics much more flexibly.

manigandham 3705 days ago

Agreed, and this is because Redis is a database first, with some pub/sub and nice lists functionality. RabbitMQ is a proper message queue (mq) which provides the necessary features for bigger applications.

However RabbitMQ is also pretty fragile and terrible at scaling. NATS.io is another system that's got the messaging right and is working on persistence soon.

kinkdr 3705 days ago

How stable is RabbitMQ? I've been looking into moving from away from redis pub/sub for a bit now.

manigandham 3704 days ago

RabbitMQ is ok in single server and has lots of flexibility but struggles at high throughput ( > 100k/sec) and the clustering setup is not great. There are also lots of edge case bugs.

If you don't need persistence, look at using nats.io which is a much more stable and reliable pub/sub system. You can build persistence on top of it or wait a few months until they finish their new project STAN.

kinkdr 3702 days ago

Thanks! 100k is far more than I need, but I couldn't find something that would fit exactly my needs, so I ended up rolling my own.

alexatkeplar 3705 days ago

IBM Bluemix has been offered "pure" hosted Kafka as their MessageHub product for a little while: https://developer.ibm.com/messaging/message-hub/

MessageHub's lead engineer Oliver Deakin gave a talk at Unified Log London recently where he explained how MessageHub was architected under the hood, was super-interesting. Slides available from here: http://www.meetup.com/unified-log-london/events/229693782/

rhodin 3704 days ago

Not a "big player" (yet), but we've been offering Apache Kafka as a Service since June 2015: www.cloudkafka.com.

amock 3705 days ago

Depending on what you mean by "this" there are offerings by the big players. Google has Cloud Pub/Sub and AWS has Kinesis in addition to SQS, so two of the big players do have offerings. I'm not familiar enough with Azure to know what it has.

koolba 3705 days ago

By "this" I meant a managed Kafka cloud offering. I generally a fan of these types of services as there isn't as tight a binding as proprietary ones. Migrating from Heroku Postgres to RDS or self hosted is well defined. Ditto for Redis migrations.

SQS, Kinesis, and other proprietary ones not so much. You can insulate your code base but if you're really going to leverage the ecosystem of those services then you're going to be stuck there. That's why I find something like this interesting. The "out" is there so it makes it easier to accept getting in.

manigandham 3705 days ago

There really isn't much lock-in when it comes to event logging systems. Just change the interface your code uses to whatever service you need. There might be a little refactoring to handle topics in the different ways but it's all ultimately the same thing.

Since logging by nature offers asynchronous processing, you can migrate your publishers first and then the consumers without any downtime.

manigandham 3705 days ago

Azure has Event Hubs that are very similar to Kinesis/Kafka.

https://azure.microsoft.com/en-us/services/event-hubs/

They also have simpler Queues and Service Bus for RPC/lightweight message handling.

plunchete 3705 days ago

Is the pricing public?

neovintage 3705 days ago

Not yet. We're working it during our early access program. Well be looking for lots of feedback from customers.

plunchete 3704 days ago

Thanks! Looking forward to be able to try it :)

nodesocket 3705 days ago

Can somebody provide a real-life use case for Kafka? I've seen comparisons between Redis, but what specifically does Kafka solve that Redis cannot?

ChartsNGraffs 3705 days ago

I'd say it's biggest differentiator from a typical messaging system is the ability to rewind and reconsume messages. It's meant to offload a large volume of data quickly and then retain it for some time so that it can processed later on. Data is published to topics and it is entirely feasible to read from one (or more topics), process that data and then publish the results to a different topic. In comparison to Redis, I would say that while they overlap they're each better suited for different problems. Redis is blazing fast, but it's parallelism/replication story isn't as great as Kafka's. Redis is a lot easier to get running though.

yolesaber 3705 days ago

Let's say you have a CMS which pushes content to your site. You also want to make the whole site searchable, so you index your content into (e.g) Elasticsearch. Kafka is great for this because you can put the content onto Kafka's message queue and then have a service reading from it which then put's it into Elasticsearch. It scales well, too. So let's say your site takes off and you have hundreds of articles published a day (not to mention updates, deletions etc) - these events can all be sent to kafka and it will maintain the order as well as still be fast. You can also have many many services reading (consuming) from it simultaneously and it will handle it nicely.

Basically, if you want to get data from one place to another and care about order, Kafka is a good solution. It acts as a middleman between services.

balamaci 3704 days ago

Hm but why would you not send it directly to ElasticSearch?

sethammons 3704 days ago

Kafka shines when you have multiple services that have data to publish and multiple services that need to read that data stream. If you have three services and they write to ES, publish metrics to some other store, and log events to the db, you could instead write that all to Kafka, and individual consumers can use the data (for instance, to put into ES). On the origin-service side, it has one integration point; it does not need to know about ES. Now let's say that your users want a near real-time dashboard of their data changes on your multiple services. All you do is make a new consumer from Kafka. You don't add it to your three services. Kafka simplifies your service relation graph.

balamaci 3704 days ago

Well I definitely support the example of using Kafka for analytics with a streaming solution like Flink or Spark, etc. However I asked the "why not directly to ES" question because the example of using Kafka just as a layer in front of ES I felt it kinda painted Kafka layer as something "we could do because we can, not because we need to".

dkersten 3704 days ago

Because then you have the problem of dual-writes.

https://martin.kleppmann.com/2015/05/27/logs-for-data-infras...

evgenyp 3704 days ago

One key ability is to batch updates.

We just implemented Kinesis (AWS service similar to Kafka) to reduce load on our Elasticsearch database (~50GB) when running hundreds of individual jobs.

Individual tasks (implemented in Celery, actually running off Redis) push to a Kinesis stream which is then consumed in batches by a very simple processor.

saryant 3704 days ago

When you need to reindex, all the original writes are still on your Kafka topic and the consumer that fed ElasticSearch and replay from the start.

kubek2k 3704 days ago

because: * you can do ES indexing async * having articles index instantly is not critical (I guess)

balamaci 3704 days ago

Well ES is pretty fast by itself - lots of people use it to store log entries(ELK stack) and every log line triggers an indexing event in ES. Introducing Kafka into the mix just seems like an unnecessary complication.

yolesaber 3704 days ago

It's not unnecessary if ES is just one of the endpoints. Kafka shines because it can accommodate a ton of consumers - so you can write once to kafka and then use it populate ES, databases, whatever. Furthermore, what happens if you need to re-index (say, if you update a mapping to an existing object)? It becomes trivial to reindex by replaying all the data from Kafka into ES thus saving you a lot of time.

If you are just dumping into ES, then yes, probably not the best tool (though it wouldn't necessarily hurt) - just use the HTTP API for that. However if you want to build a robust pipeline for multiple services or think you'll be needing to scale the feed into ES, Kafka is useful.

manigandham 3705 days ago

Kafka and Redis are very different things - see this: https://news.ycombinator.com/item?id=11577312

Redis is a database, Kakfa is a data logging system built for scale and throughput. Event processing (of any kind like stocks, ad impressions, ecommerce purchases) are a great fit. Also good as a message queue unless you need ultra low-latency RPC.

nodesocket 3705 days ago

Gotcha, so then advantage of Kafka over Logstash + ElasticSearch?

saryant 3705 days ago

Kafka, Redis and Logstash+ElasticSearch have really nothing to do with each other.

Kafka is a distributed, fault-tolerant and highly scalable message broker.

Redis is a very fast key/value (another other data types) store.

I suppose that at a high level Logstash can be compared to Kafka but IME Logstash can't handle scale. It's trivially easy to bring Logstash to its knees.

Elasticsearch is, well, a search engine.

010a 3705 days ago

Redis does a lot more than just store keys and values. Functionally speaking Redis Pub/Sub and Kafka are interchangable up to a certain level of throughput.

saryant 3705 days ago

They really serve very different use cases. Redis pub/sub consumers only receive messages while connected, whereas Kafka consumers can pick up where they left off. Event ordering, back pressure, etc.

There are many use cases Redis pub/sub can't serve beyond just scalability.

balamaci 3704 days ago

Well Logstash can output data into Kafka or ElasticSearch. So you could for example transform logs to json or do simple text processing in Logstash and put it to ElasticSearch for logsearching but you can also put it in Kafka and then have a stream parsing with all sorts of tools like Flink, Spark, etc. You could then have the possibility do so some realtime analysis on what the user do all over your stack. Too many "Login Failed" events and maybe you have an attacker trying to bruteforce a passsword and maybe you need to present him with a captcha screen.

manigandham 3705 days ago

ElasticSearch is a database optimized for searching, not related at all.

You can somewhat compare Kafka to Logstash but Kafka has no processing, it's purely a distributed log writing/reading/storage system that also scales far more than logstash can. You write data to it and then read from it with a basic messaging abstraction of topics and partitions.

ec109685 3705 days ago

ElasticSearch can store sequenced number data, which is really all that Kafka is doing, so I don't think it is fair to say it isn't related at all.

allengeorge 3704 days ago

So can a RDBMS... But that doesn't mean that Kafka and databases are related.

As multiple comments have stated above, Kafka is really a distributed message subsystem. Its core interface is a set of topics that one can publish to, and that consumers can read from (in other words, a pub-sub system). Kafka doesn't inspect the message payload at all.

Elasticsearch is a unstructured (to some extent) document store that's optimized around document search. So at the very least, the payload is important when using Elasticsearch.

manigandham 3704 days ago

That's like saying they all write data to disk so they're related.

Elasticsearch is all about saving, inspecting, indexing and retrieving your data through a rich document-based model and search-optimized methods.

ES might be able to do the same thing functionally because it operates at a higher level but ultimately will never scale or be as simple in access as Kafka.

jbob2000 3704 days ago

The comments in this thread are funny;

Hey, what is Kafka?

"It's a distributed logging system, not a message queue"

Ok, what's the use case?

describes a case when its used as a message queue

tibbon 3705 days ago

Kafka vs Redis. I've only used Redis... what should I know?

manigandham 3705 days ago

Redis is an in-memory (with persistence) key-value database that also implements some basic structures like lists, sets and hashes natively.

Kafka is a distributed logging system that can ingest large amounts of data straight to disk, then allows for multiple consumers to read this data through a simple abstraction of topics and partitions. Consumers maintain their own position of where they last read up to (or re-read things if they want) and everything is sequential I/O which creates very high throughput.

mtw 3704 days ago

What kind of companies or startups usually use this service?

rhodin 3704 days ago

Companies dealing with large amounts of data. A list with some companies using Apache Kafka can be found here: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By

mtw 3704 days ago

thanks. I guess my sites are not big enough (yet)

poooogles 3704 days ago

It's pretty big in ad tech, or anywhere that really does lots and lots of centralised logging (Datadog/Loggly both use Kafka).

Lots of places also use it just as a message queue, some places for example write time series metrics to Kafka for monitoring.

tenismyanswer 3704 days ago

All kafkaesque to me ;->

elcct 3704 days ago

My impression of Kafka was that this thing is bloated. How it compares to something like NSQ?

kasey_junk 3704 days ago

Its a completely different use case. Many times people call Kafka a "message queue" but its not. It's a distributed log service. Its possible to build a message queue on top of a distributed log service but there are reasons not to.

Its better to think of Kafka as a database for events, not as a transport mechanism for those events.

As for being bloated, Kafka lives in a very empty space, that is it supports fully ordered events to all consumers (and it has good HA options). The only other tool that I've come across that gives you the same data guarantees is Kinesis and it requires AWS.

I've found that yes Kafka is complex, but its complex because its solving a complex problem, not because its bloated.

That said, if you want a non-ordered message queue, use NSQ instead of Kafka.

elcct 3704 days ago

Thanks for explanation. I didn't know those things.