Hacker News new | ask | show | jobs
by maccard 604 days ago
I've heard wonderful things about ClickHouse, but every time I try to use it, I get stuck on "how do I get data into it reliably". I search around, and inevitably end up with "by combining clickhouse and Kafka", at which point my desire to keep going drops to zero.

Are there any setups for reliable data ingestion into Clickhouse that don't involve spinning up Kafka & Zookeeper?

18 comments

At my company we use Vector to ingest into ClickHouse. It works really well. Vector does buffering and retrying.

Vector is a relatively simple ingest tool that supports lots of sources and sinks. It's very simple to run — just a config file and a single binary, and you're set. But it can do a fair amount of ETL (e.g. enriching or reshaping JSON), including some more advanced pipeline operators like joining multiple streams into one. It's maybe not as advanced as some ETL tools, but it covers a lot of ground.

Since you mention Kafka, I would also mention Redpanda, which is Kafka-compatible, but much easier to run. No Java, no ZooKeeper. I think you'd still want Vector here, with Vector connecting Redpanda to ClickHouse. Then you don't need the buffering that Vector provides, and Vector would only act as the "router" than pulls from Redpanda and ingests into ClickHouse.

Another option is RudderStack, which we also use for other purposes. It's a richer tool with a full UI for setting up pipelines, and so on.

Interesting, that's not a problem I've come across before particularly - could you share more?

Are you looking for setups for OSS ClickHouse or managed ClickHouse services that solve it?

Both Tinybird & ClickHouse Cloud are managed ClickHouse services that include ingest connectors without needing Kafka

Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API, so you can connect it with ClickHouse as if it was Kafka, without actually having Kafka (though I'm not sure if this is in the open source Estuary Flow project or not, I have a feeling not)

If you just want to play with CH, you can always use clickhouse-local or chDB which are more like DuckDB, running without a server, and work great for just talking to local files. If you don't need streams and are just working with files, you can also use them as an in-process/serverless transform engine - file arrives, read with chDB, process it however you need, export it as CH binary format, insert directly into your main CH. Nice little pattern than can run on a VM or in Lambda's.

Sure - I work in games, and we stream eventsfrom clients that we want to store in Clickhouse. We've got a native desktop application written in C++ that generates a json payload (we control the format of this). We don't need OSS, but we don't want a SAAS service - we want on-prem (or self managed). Clickhouse Cloud would be fine, TinyBird not.

> Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API

This is definitely an improvement, but if it looks like kafka and sounds like kafka, I get a bit sus.

> If you just want to play with CH, you can always use clickhouse-local

I've done that, but getting from this to "streaming data" is where I get stuck.

> If you don't need streams

Afraid streams are what I'm dealing with..

I’ve got a lib/executable that you spin up and it gives you a rest api (read and write) for clickhouse if you are interested
I know this is a separate point, but the pricing link on your website in the header, is broken.
What's wrong with using Postgres, MySQL or SQL server?
They work fine, but this is a thread on Clickhouse.

Clickhouse sells itself as a DBMS for real time analytical reports, which is exactly what I want. But I can't compare the two because I've never managed to get it all stood up.

I work in gaming and stream events into a self-hosted Clickhouse db without Kafka. We just use the CH python connector and send records in batches of 100K, using ReplacingMergeTree for backfills, etc. It works very well. Unless you truly need up-to-the-minute analytics, it’s super easy to schedule with Dagster or Airflow or whatever. We process 100M+ events per day this way.
How do you ensure retries, single entries, and not losing 100k entries if your app goes down?

It's also kind of a bummer that the batches have to be inserted, when the tagline on Clickhouse's website is:

> Build real-time data products that scale

But, thanks for the clarification!

It’s fair tho. This conversation is “if you use clickhouse, then this is how you would solve x”

And a completely fair question is “why would I want to spin up a completely new stack when I have psql already installed?”

In my (very limited ) experience you almost never do want to, but when you do, you wish you had started 6 months ago.

> but every time I try to use it, I get stuck on "how do I get data into it reliably"

That's the same stage I get stuck every time.

I have data emitters (in this example let's say my household IoT devices, feeding a MQTT broker then HomeAssistant).

I have where I want the data to end up (Clickhouse, Database, S3, whatever).

How do I get the data from A to B, so there are no duplicate rows (if the ACK for an upload isn't received when the upload succeeded), no missing rows (the data is retried if an upload fails), and some protection if the local system goes down (data isn't ephemeral)?

The easiest I've found is writing data locally to files (JSON, parquet, whatever), new file every 5 minutes and sync the older files to S3.

But then I'm stuck again. How do I continually load new files from S3 without any repetition or edge cases? And did I really need the intermediate files?

Easiest way is to post csv/json/whatever through the http endpoint into a replacing merge tree table.

Duplicates get merged out, and errors can be handles at the http level. (Admittedly, one bad row in a big batch post is a pain, but I don’t see that much)

HTTP errors aren’t the most readable, although traditional database errors aren’t too readable most of the time.
What I meant is that you'll get an HTTP error code from the insert if it didn't work, so that can go through the error handling. This isn't really an "explore this thing", it's a "splat this data in, every minute/file/whatever". I've churned through TBs of CSVs this way, with a small preprocessor to fix some idiosyncratic formatting.
Cloudflare workers combined with their queues product https://developers.cloudflare.com/queues/ might be a cheap and easy way of solving this problem
This is _exactly_ my problem, and where I've found myself.
This isn't appropriate for all use-cases, but one way to address your and GP's problem is as follows:

1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches.

2. Ship those batches to S3/alike, NFS that clickhouse can read, or equivalent (even a dead-simple HTTP server that just receives file POSTs and writes them to disk, running on storage that clickhouse can reach). The tool you use here needs to be idempotent (retries of failed/timed out uploads don't mangle data), and atomic to readers (partially-received data is never readable).

3. In ClickHouse, run a scheduled refresh of a materialized view pointed at the uploaded data (either "SELECT ... INFILE" for local/NFS files, or "INSERT INTO ... SELECT s3(...)" for an S3/alike): https://clickhouse.com/docs/en/materialized-view/refreshable...

This is only a valid solution given specific constraints; if you don't match these, it may not work for you:

1. You have to be OK with the "experimental" status of refreshable materialized views. My and other users' experience with the feature seems generally positive at this point, and it has been out for awhile.

2. Given your emitter data rates, there must exist a batch size of data which appropriately balances keeping up with uploads to your blob store and the potential of data loss if an emitter crashes before a batch is shipped. If you're sending e.g. financial transaction source-of-record data, then this will not work for you: you really do need a Kafka/alike in that case (if you end up here, consider WarpStream: an extremely affordable and low-infrastructure Kafka clone backed by batching accumulators in front of S3: https://www.warpstream.com/ If their status as a SaaS or recent Confluent acquisition turns you off, fair enough.)

3. Data staleness of up to emitter-flush-interval + worst-case-upload-time + materialized-view-refresh-interval must be acceptable to you.

4. Reliability wise, the staging area for shipped batches (S3, NFS, scratch directory on a clickhouse server) must be sufficiently reliable for your use case, as data will not be replicated by clickhouse while it's staged.

5. All uniqueness/transformations must be things you can express in your materialized view's query + engine settings.

Thanks for the well thought out reply here. I understand the solution you're proposing, but the thing is that it fails at the first hurdle.

> 1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches.

Clickhouse's Tagline on their website is:

> Build real-time data products that scale

Except, the minute we start having to batch data to process it and stage it, we lose the "real time" part. If I'm shipping them to S3 to have clickhouse batch ingest them, I might as well be use Databricks, Snowflake, or just parquet-on-s3.

All very fair, though I think your issue may be more with the nature of real-time analytics ingestion pipelines in general than with Clickhouse itself.

Even if you could remove all of the operational burden from Kafka or equivalent, hooking it up to Clickhouse is still, at the end of the day, going to commit in batches (of max_insert_block_size, or kafka_max_block_size, or smaller batches polled from the message broker). Even with no consumer lag, that's still going to incur a delay before your data is SELECTable.

Heck, even Kafka publishers usually don't flush (actually send over the network) after every publish by default.

That same tradeoff comes up in Snowflake and Databricks (albeit mitigated when using Continuous Processing, which is experimental and expensive computationally and monetarily). Their ingestion systems are batching as well.

At the end of the day, "real time" means different things to different people, and you'll have to choose between one of several architectures:

- Clients synchronously insert data (which is then immediately visible) into your analytics store. ClickHouse is less good at handling a barrage of single-row INSERTs than other DBs, but none of them are good at this type of workload at even medium scale. Even manually shipping single-update files to S3 gets expensive and slow fast.

- Batch your inserts and accept bounded lag in data visibility. Doesn't matter whether batching is client-side, database-side, or in an intermediate broker/service.

- Ship your data asynchronously via messaging/streaming/batching and force point-in-time queries to wait for some indication that asynchronous data for the requested point in time has arrived. For example, when batching manually you could delay queries until a batch subsequent to the time-of-query has arrived, or when using Kafka you could wait for the system of record's last-committed-kafka-message-id to pass your topic's max ID at the time of query.

I insert data into ClickHouse by single records on each page view, and it is alright. The data is selectable at the same second:

    SELECT * FROM website_traffic WHERE time >= now() - INTERVAL 1 SECOND
My experience and knowledge with CH is about 3-4 years olds now, so I might be talking out of ignorance at this point.

There are plenty of ways to do it with batching, but I assume you want to real-time "insert into table" style or a direct "ch.write(data)", then no. There is no way as far as I know without batching. This is one of the main reason we stopped CH for our last project about 3 years ago for financial data analytic tooling. CH doesn't have a transaction log like WAL, so your data producers need to be smart or you need a "queue" type service to deal with it, whether it's S3 or Kafka or Kinesis to allow batching.

> I search around, and inevitably end up with "by combining clickhouse and Kafka"

Those are probably some old sources of knowledge. You need to use Kafka if you want it to handle batching for you. But Clickhouse can handle batching as well by using asynchronous inserts:

https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...

It seems you can use JSON, CSV and Parquet: https://clickhouse.com/docs/en/integrations/data-formats
There is an HTTP endpoint, client database drivers, CLI tool and third party tools like Vector, Redpanda Connect?

What makes Clickhouse different that you're unable to load data into?

Yes, reliable data ingestion often involves Kafka, which can feel complex. An alternative is the transactional COPY INTO approach used by platforms like Snowflake and Databend. This command supports "exactly-once" ingestion, ensuring data is fully loaded or not at all, without requiring message queues or extra infrastructure.

https://docs.databend.com/sql/sql-commands/dml/dml-copy-into...

Not sure if it's enough for you but there is RedPanda, a Zookeeper-less Kafka.
I had success loading data with vector.dev
This is what we do - works well.
I run a fairly large Clickhouse cluster for advertising data with millions of events every minute streaming in. We use fluentd as a buffer which batches data for upto n records/n minutes and does batch inserts to clickhouse. Its not realtime but close enough and have found it to be pretty reliable.
I think Tinybird is a nice option here. It's sort of a managed service for ClickHouse with some other nice abstractions. For your streaming case, they have an HTTP endpoint that you can stream to that accepts up to 1k EPS and you can micro-batch events if you need to send more events than that. They also have some good connectors for BigQuery, Snowflake, DynamoDB, etc.
Not sure if ClickHouse needs ZK but FWIW Kafka has a raft implementation which now obviates need for ZK
ClickHouse does need ZK but they have their own implementation.
Where are you loading the data from! I had no trouble loading data from s3 parquet.
I'm streaming data from a desktop application written in C++. It's the step to get it into parquet in the first place.
We use this Rust library to do individual and batch inserts: https://docs.rs/clickhouse/latest/clickhouse/

The error messages for batch inserts are TERRIBLE, but once it’s working it just hums along beautifully.

I’d be surprised if there isn’t a similar library for C++, as I believe clickhouse itself is written in C++

There is an http API and it can eat json and csv too (as well as tons of others)
Fivetran has a destination for it: https://fivetran.com/docs/destinations/clickhouse
I was glad in the past few years to discover that I am not alone in finding Kafka off-putting / way too convoluted
Where is your data coming from? I’m curious what prevents you from inserting the data into Clickhouse without Kafka.
How do you do this with other DBs?