| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by barrkel 3366 days ago

I recently had cause to work with Kafka Connect; we needed to get data from MySQL into Hadoop. It was not a positive experience, and IMO Kafka Connect is pretty immature, and Kafka as it is currently constituted isn't well featured enough for this purpose.

Kafka Connect is architected around the idea of producers and consumers that either add messages to Kafka, or read messages from Kafka.

The MySQL producer isn't suited for anything other than the most basic of table replication, though; if you need any ETL, you'll be gluing more stuff into your pipeline downstream. And when the producer falls over, the first you'll know about it is when you read the logs or poll status indicators. It didn't give me warm fuzzies about reliability nor visibility nor flexibility. It was very basic stuff.

The Hadoop consumer had an unpleasant surprise: you have zero choice over table name in Hive metastore; your Kafka topic will be the name of your Hive table, no ifs, no buts, no choices. And since Kafka doesn't have any namespaces, either you're going to be running multiple Kafka clusters, or you need global agreement on topics vs Hive metadata (which does have namespaces). We have a multi-tenancy architecture and use namespaces. A non-starter.

Why do I think Kafka doesn't have the right feature set? Because Kafka message expiry has only two policies, as far as I could tell: time or space. Either your message is too old and gets discarded (en bloc, IIRC); or Kafka hits a space limit and starts clearing out old messages. The natural question that arises when you're using Kafka to buffer a consumer vs a producer, then, is flow control / backpressure: how do you get the producer to slow down when the consumer can't catch up? And vice versa? Well, there's knobs you can manually control to throttle the producer, but it's in your own hands. You're dancing at the edge of a cliff if a consumer has died and messages start expiring; there's nothing stopping data loss.

The only way you can start to turn this situation into a win is if (a) you have such a big firehose of data that nothing else can cope, or (b) you can take advantage of network effects and use Kafka as a data bus, not just as a pipe with exactly two ends. But it has to overcome the downsides.

4 comments

Xorlev 3365 days ago

> Well, there's knobs you can manually control to throttle the producer, but it's in your own hands. You're dancing at the edge of a cliff if a consumer has died and messages start expiring; there's nothing stopping data loss.

At least how we run Kafka, our logs expire after 7 days, and our alerts go off pretty quickly if consumers fall behind. Additionally, we archive all our messages to S3 via a process based on Pinterest's Secor [1]. If we were to ever run so far behind that we needed to start over, we can just run mapreduce jobs to rebuild datastores and then let consumers catch back up.

Since Kafka is explicitly a pub/sub replicated+partitioned log, it doesn't make sense to provide backpressure. A single ailing consumer would cascade failure through your system. If you need synchronous or bounded replication, Kafka isn't for you.

Having run Kafka in production for 2 1/2 years now, I can say with certainty that we've never felt like we were lacking in terms of features from Kafka its self, nor have we ever had a consumer fall so far behind it could never catch back up. We do leverage our archives for batch jobs though.

[1] https://github.com/pinterest/secor

link

barrkel 3365 days ago

I think it's worth qualifying my criticism more explicitly. I think Kafka doesn't have the right feature set for Kafka Connect. When trying to use it as a data pipe for real-ish time updates between two persistent stores, rather than a persistent store in itself, it's inadequate.

link

saryant 3365 days ago

We've had a Kafka system in production for maybe 7 years that deals with this problem.

With our use case, we can have unpredictable spikes in volume—which we must consume. Those spikes can be an order of magnitude larger than our baseline average. We put Kafka topics between every stage of our processing pipeline and configure the various Kafka clusters to be

1) Huge. Seriously, way overkill.

2) Able to sustain triple the largest spike we've ever seen without expiring data (size-based expiry)

Since most of our processing stages are essentially consuming, transforming and publishing back to Kafka, we've written them to not ack a message until the result of that stage has been safely published to the next Kafka topic. We require acks from all in-sync replicas. Since the subscriber part of a processing stage doesn't ack until its producer side has received acks from all ISRs, we're pretty confident in our data fidelity. In fact we have other infrastructure that verifies that everything coming out of this Kafka chain is correct and full-fidelity, so we know for certain that this setup can withstand huge spikes in volume without any load shedding.

And then we run all of it redundantly in multiple AWS availability zones to just be sure.

If any stage in our processing pipeline cannot keep up with increased volume, that's fine—it'll catch up eventually because we know that our retention policies are sufficient. And since (almost) every stage is run redundantly, even if one instance somewhere does become slow (or goes down), the redundant pipeline will keep data flowing so we generally have no customer impact. In fact if that does happen but the system as a whole keeps up, we don't even consider it a pageable event. If a machine falls over at 3am but it's redundant cousin keeps up, we'll fix it the next day during business hours.

(Redundant pipelines are also great for deploys—take down an entire side while you redeploy and you've now got zero-downtime deployments)

link

dantiberian 3365 days ago

> Because Kafka message expiry has only two policies, as far as I could tell: time or space.

You can use log compaction to remove all messages apart from the latest for any given key. This gives you a Kafka queue with a bounded size, that is proportional to the table size that you're replicating.

https://kafka.apache.org/documentation/#compaction

link

capkutay 3365 days ago

Disclaimer: I work on a product that does Log-based replication from MySQL to Kafka which includes the metadata needed for doing ETL.

Only sharing in case someone wants a production-ready solution in that area with service monitoring, guaranteed E1P for every event, stream-level permissions, high availability etc.

You can check it out here:

http://www.striim.com/

link

bcryptd 3365 days ago

Why is there no pricing info on the website?

link