I've been using kafka 0.8.2 for some time now together with Node.js for both consumer and producer.
Although the producer side is quite simple to use and have more than one option available, the consumer side there is only one project that is "maintained" and works [1][2], all other opstions either only have producer available [3] or have not received a commit in years [4].
I am a bit disappointed about how little attention Node.js with kafka had so far as there are a lot of issues on keeping connection alive and rebalancing that made it really hard to trust the system and automate zero downtime deploys.
Although I still hope all these changes in 0.9 new consumer API solve these issues, I am really happy about the decision to be backwards compatible, making the transition/upgrade a much more smooth process
> To ensure a smooth upgrade paths for our users, the 0.8 producer and consumer clients will continue to work on an 0.9 Kafka cluster.
Your critique is well received. The Apache Kafka project has support for the Java clients and the non-java clients will be developed and available in a federated manner. At Confluent, we are focused on providing first class non-java clients that are API and functionality compatible with the java clients. Forthcoming releases of the Confluent Platform will include a C/C++, python and node.js client. Stay tuned http://www.confluent.io/developer#download
We've been using Kafka for almost two years already. Still lots of our codebase is Ruby, and Poseidon is not really that great client library. It's slow and it's not threadsafe.
In our Scala-side we're happy with the current offering.
The C/C++ client, will that be librdkafka? If I need a high-quality client for another language, would you recommend building that using librdkafka or the rest proxy?
The worst thing about Kafka in my experience has been the consumer libraries for languages like Python. That's not to say that they are terrible or unusable, just that they don't have nearly as much polish as the core of Kafka itself. I'm very much looking forward to new client libraries built against the new consumer API.
PyKafka is currently used in production at Parse.ly, and I've gotten feedback from a lot of other folks who are using it in production as well. The big benefit over kafka-python is that PyKafka supports multi-consumer groups that balance consumption via ZooKeeper with its BalancedConsumer interface. See this thread ( https://github.com/Parsely/pykafka/issues/334 ) for more detail on the differences between the two libraries.
The PyKafka project is prioritizing support for Kafka 0.9 in the next few weeks/months. This includes ensuring that the existing consumers work against the updates to the 0.8.2 consumer API as well as implementing support for the new consumer API introduced in 0.9. Roadmap information can be found here ( https://github.com/Parsely/pykafka/blob/master/doc/roadmap.r... ).
I'd say the Python library I used was borderline unusable, we stopped using Kafka (it was just a trial period, wasn't rolled to production yet) because of limits in one of the most popular Python interfaces. The interface worked well enough, the API was good, but they didn't (and the bug tracker seemed to imply they wouldn't) support synchronizing reads across processes for the same group. What's the point in a distributed synchronized log if you can't do synchronized distributed reads of the log?
Yeah, it's no longer relevant for that project, but I like the ideas behind Kafka and will probably use it again so I'll look at PyKafka before I look at kafka-python in the future.
While it feels a bit hacky and unclean, you may want to try using IKVM (http://www.ikvm.net/) to translate and import the Java client in to your .NET project.
Given the difficulty in building a client period (distributed systems, race conditions, etc), being able to rely on the widely adopted & supported official client is quite attractive.
In my test cases the performance is on par running natively on the JVM, except when compression is enabled.
Another option is using the REST proxy and accepting the trade-offs that imposes.
I have been using Kafka 0.8.2 in a production setting for consuming real-time event traffic from our caching layer for six months. The most difficult parts of my experience were the occasional consumer lags that erupted without warning/cause in the high level Java consumer APIs. A lot of experimentation with their configuration proved to be futile and now we have had to create a feedback system that triggers alerts to change group Ids of our high level consumers every time some consumers start lagging.
Otherwise the performance of Kafka has been impressive (giving a throughput of upto 15000 packets/sec to a 8-consumer pool), even though I have not had the chance to compare it with any other such tool/library.
Nevertheless, I think this update is a long awaited one, and Kafka Connect may really be good starting point for building more (and better) endpoints.
Were the diagrams done with software, or hand drawn? If software, I'm curious what package/style you used, the style looks very similar to Martin Kleppmann's presentation at StrangeLoop; I assumed his were hand drawn but I'm realizing now this might be a omni style or something.
Kafka is a replicating database, but not of randomly accessible or queryable data - rather, it's for logs where you can start at any point in time (including realtime), and play the log forwards from there. If you aggregate access logs into it, then use it to feed stream processing frameworks (or feed it into Hadoop for bulk processing), you can use it very effectively for analytics workloads.
Or you can run your entire business around events-as-ground-truth rather than SQL-style-domain-tables-as-ground-truth. Various modules consume from Kafka and write aggregated records back into Kafka - basically a directed graph of processing modules with Kafka to implement every edge - and finally there's a module that translates the final product into what is essentially a live-updating materialized view for your web backend to consume. LinkedIn does exactly this - they open-sourced Kafka and spun out Confluent to help others use this model.
Stream processing is a very powerful way of thinking about data management - with the great side effect that "migrations" of data tables are limited by your imagination, and they never run the risk of data corruption. We use this paradigm, though not Kafka itself (Mongo supports this paradigm and simplifies a lot of things if you aren't yet at LinkedIn scale), in production at http://belstone.com.
Kafka can be used if you want to treat data as streams for some processing (think producer-consumer kind of scenarios). You can point to the stream from any point in time to read it 'as and when things happened'. Kafka's own nodes have replication enabled, and the data that it produces can be consumed in a distributed setting as well (meaning multiple consumers acting as a single high level consumer). But it is not a traditional database as MongoDB or MySQL.
That observation is correct. Currently, people misuse stream processing systems like Storm and Samza for data import/export. This is an overkill. Kafka Connect is focused on providing scalable and operational connectors to various systems using Kafka as the underlying transport mechanism.
We have built a fairly robust system along the idea of Kafka Connect and it's open sourced at https://github.com/flipkart/aesop. Currently, it supports MySQL & Hbase as sources and MySQL, Hbase, ES, Kafka as destination.
Although the producer side is quite simple to use and have more than one option available, the consumer side there is only one project that is "maintained" and works [1][2], all other opstions either only have producer available [3] or have not received a commit in years [4].
I am a bit disappointed about how little attention Node.js with kafka had so far as there are a lot of issues on keeping connection alive and rebalancing that made it really hard to trust the system and automate zero downtime deploys.
Although I still hope all these changes in 0.9 new consumer API solve these issues, I am really happy about the decision to be backwards compatible, making the transition/upgrade a much more smooth process
> To ensure a smooth upgrade paths for our users, the 0.8 producer and consumer clients will continue to work on an 0.9 Kafka cluster.
[1] https://cwiki.apache.org/confluence/display/KAFKA/Clients#Cl...
[2] https://github.com/SOHU-Co/kafka-node/
[3] https://github.com/sutoiku/node-kafka
[4] https://github.com/wurstmeister/node-kafka-0.8-plus