Hacker News new | ask | show | jobs
by darkbatman 870 days ago
how are you ingesting from kinesis to clickhouse. are you using some custome sink connector or processes on ec2 or lambda?
1 comments

We actually use Kafka rather than Kinesis, although they're very similar. For writing to ClickHouse from Kafka, we use the ClickHouse Kafka sink connector: https://github.com/ClickHouse/clickhouse-kafka-connect.
we are actually trying something similar but possible kinesis + clickhouse or kafka + clickhouse. Currently kinesis seems easier to deal with but not a good intergration or sink connector available to process records at scale for kinesis to put into clickhouse. Were you ever felt into similar problems where you had to process records at huge scale to be able to insert into clickhouse without much delay.

One more thing is kinesis can have duplicates while kafka is exactly once delivery.

I'm not familiar with Kinesis's sink APIs, but yes I'd imagine you'll have to write your own connector from scratch.

To answer your question, though, no: in the Kafka connector, the frequency of inserts into ClickHouse is configurable relatively independent of the batch size, so you don't need massive scale for real-time CH inserts. To save you a couple hours, here's an example config for the connector:

  # Snippet from connect-distributed.properties

  # Max bytes per batch: 1 GB
  fetch.max.bytes=1000000000
  consumer.fetch.max.bytes=1000000000
  max.partition.fetch.bytes=1000000000
  consumer.max.partition.fetch.bytes=1000000000

  # Max age per batch: 2 seconds
  fetch.max.wait.ms=2000
  consumer.fetch.max.wait.ms=2000

  # Max records per batch: 1 million
  max.poll.records=1000000
  consumer.max.poll.records=1000000

  # Min bytes per batch: 500 MB
  fetch.min.bytes=500000000
  consumer.fetch.min.bytes=500000000
You also might need to increase `message.max.bytes` on the broker/cluster side.

If you're still deciding, I'd recommend Kafka over Kinesis because (1) it's open source so more options, e.g. self host or Confluent or AWS MSK and (2) it has a much bigger community, meaning better support, more StackOverflow answers, a plug-and-play CH Kafka connector, etc.

Thanks these config are helpful