Hacker News new | ask | show | jobs
Apache Kafka – Scaling SQL with KStream and Kafka Connect (youtube.com)
50 points by Antwnis 3107 days ago
1 comments

I had an entertaining few days working with Confluent's Kafka Connect stuff. I was trying to connect a MySQL table to Kafka and then on out to Hadoop. Amusingly, Kafka Connect wanted to use a queue with the same name as my table (MySQL or Hive / Hadoop, I don't recall which end); but of course since Kafka doesn't have namespaces, I had better hope that my table name is unique across the whole cluster!

It was around about then that I figured out that Confluent was a bunch of kids playing at building stuff. I have zero doubt that it's a good base if you have an enormous firehose of data, but look for features beyond raw performance and basic correctness, and it's underdeveloped. Basic stuff like back-pressure - don't expect it, either overallocate your storage or make sure you always have faster consumers than producers.

> Amusingly, Kafka Connect wanted to use a queue with the same name as my table (MySQL or Hive / Hadoop, I don't recall which end)

It'll be the MySQL end if it's a Connect source as opposed to sink.

Two options - in your Connect config, you can specify a topic prefix, or if you use a custom query, the topic prefix will be used as the entire topic name.

> It was around about then that I figured out that Confluent was a bunch of kids playing at building stuff.

Kafka Connect saved me writing a load of boilerplate to monitor a PG database to propagate model updates in a medium suitable for streaming jobs - Kafka Connect + Kafka Streaming's Global KTables is a nice fit, even if the Connect JDBC end is somewhat beta at this point (KTables rely on Kafka message key for identity, the JDBC source doesn't populate it by default, so you have to use Single Message Transforms (SMTs) to achieve it)

I'd say beta, not kids.

I'm not quite certain I understand that last part. To the best of my understanding, Kafka's design is such that consumer ingestion is completely decoupled from production throughput. All messages from the producer are first copied to disk and then zero-copied from disk to network to the consumers at the byte array level. If a consumer falls behind, it has its own independent offset stored on the broker that keeps track of where in the byte array (or log) it left off. This, by design, allows Kafka to handle different profiles of consumers and even have consumers drop off entirely and later join to catch up. But perhaps I'm missing something about what you're saying.
Regarding your first issue, this is very much a matter of defaults. I can't be sure of your exact pipeline and connectors, but if, for example, you were using the JDBC connector, it has included support for at least prefixing names since the original version, effectively supporting the namespacing you require https://docs.confluent.io/current/connect/connect-jdbc/docs/.... I agree this might not be as ideal as namespacing directly at the Kafka layer for some users. The addition of single message transforms to arbitrarily modify the topic names (based on the existing topic name or really any data in the record or any info in the transformation config) gives a lot more flexibility as of Kafka 0.10.2. On the Hadoop/Hive side, I think there may still be that limitation; transformations effectively remove it since you can arbitrarily adjust the topic the sink connector sees, but this probably isn't an obvious solution. Also, we really would prefer to avoid any coding required when using Connect. It's a difficult tradeoff between standardization (same configs everywhere), usability (minimize configs the user has to set), and simplicity+immediate usability (transformations came later and introduce configuration complexity). I (and other Kafka contributors) are certainly welcome to thoughts on how to make this all simpler; I think most software, especially open source software, errs too heavily on towards configurability, but clearly in your case you found things not configurable enough.

re: the point about backpressure, there are plenty of cases where you don't want backpressure. If you want the thing that's producing data to keep humming along even if some downstream app (lets say Connect dumping the data into HDFS for some downstream batch analytics), you don't want to see backpressure. In Kafka you should just define your retention period to be long enough to cover any slowness/lag in consumer applications -- it's pretty fundamental to its design and use cases that it doesn't have explicit backpressure from consumers back to producers. (You do get backpressure from a single broker back to the producer via the TCP connection, but I assume you meant from consumer back to producer.)