A new JSON data type for ClickHouse | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	A new JSON data type for ClickHouse (clickhouse.com)
	382 points by markhneedham 601 days ago

17 comments

ramraj07 600 days ago

Great to see it in ClickHouse.

Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.

leetrout 600 days ago

Scratch data does this as well with duckdb

https://github.com/scratchdata/scratchdata

nojvek 600 days ago

Singlestore has been doing json -> column expansion for a while as well.

https://www.singlestore.com/blog/json-builtins-over-columnst...

For a colstore database, dealing with json as strings is a big perf hit.

statictype 600 days ago

Do you have a link to the Snowflake whitepaper?

JosephRedfern 600 days ago

Perhaps this: https://event.cwi.nl/lsde/papers/p215-dageville-snowflake.pd...

maccard 600 days ago

I've heard wonderful things about ClickHouse, but every time I try to use it, I get stuck on "how do I get data into it reliably". I search around, and inevitably end up with "by combining clickhouse and Kafka", at which point my desire to keep going drops to zero.

Are there any setups for reliable data ingestion into Clickhouse that don't involve spinning up Kafka & Zookeeper?

atombender 597 days ago

At my company we use Vector to ingest into ClickHouse. It works really well. Vector does buffering and retrying.

Vector is a relatively simple ingest tool that supports lots of sources and sinks. It's very simple to run — just a config file and a single binary, and you're set. But it can do a fair amount of ETL (e.g. enriching or reshaping JSON), including some more advanced pipeline operators like joining multiple streams into one. It's maybe not as advanced as some ETL tools, but it covers a lot of ground.

Since you mention Kafka, I would also mention Redpanda, which is Kafka-compatible, but much easier to run. No Java, no ZooKeeper. I think you'd still want Vector here, with Vector connecting Redpanda to ClickHouse. Then you don't need the buffering that Vector provides, and Vector would only act as the "router" than pulls from Redpanda and ingests into ClickHouse.

Another option is RudderStack, which we also use for other purposes. It's a richer tool with a full UI for setting up pipelines, and so on.

sdairs 600 days ago

Interesting, that's not a problem I've come across before particularly - could you share more?

Are you looking for setups for OSS ClickHouse or managed ClickHouse services that solve it?

Both Tinybird & ClickHouse Cloud are managed ClickHouse services that include ingest connectors without needing Kafka

Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API, so you can connect it with ClickHouse as if it was Kafka, without actually having Kafka (though I'm not sure if this is in the open source Estuary Flow project or not, I have a feeling not)

If you just want to play with CH, you can always use clickhouse-local or chDB which are more like DuckDB, running without a server, and work great for just talking to local files. If you don't need streams and are just working with files, you can also use them as an in-process/serverless transform engine - file arrives, read with chDB, process it however you need, export it as CH binary format, insert directly into your main CH. Nice little pattern than can run on a VM or in Lambda's.

maccard 600 days ago

Sure - I work in games, and we stream eventsfrom clients that we want to store in Clickhouse. We've got a native desktop application written in C++ that generates a json payload (we control the format of this). We don't need OSS, but we don't want a SAAS service - we want on-prem (or self managed). Clickhouse Cloud would be fine, TinyBird not.

> Estuary (an ETL tool) just released Dekaf which lets them appear as a Kafka broker by exposing a Kafka-compatible API

This is definitely an improvement, but if it looks like kafka and sounds like kafka, I get a bit sus.

> If you just want to play with CH, you can always use clickhouse-local

I've done that, but getting from this to "streaming data" is where I get stuck.

> If you don't need streams

Afraid streams are what I'm dealing with..

ruslan_talpa 600 days ago

I’ve got a lib/executable that you spin up and it gives you a rest api (read and write) for clickhouse if you are interested

latchkey 598 days ago

I know this is a separate point, but the pricing link on your website in the header, is broken.

DeathArrow 600 days ago

What's wrong with using Postgres, MySQL or SQL server?

maccard 600 days ago

They work fine, but this is a thread on Clickhouse.

Clickhouse sells itself as a DBMS for real time analytical reports, which is exactly what I want. But I can't compare the two because I've never managed to get it all stood up.

nrjames 600 days ago

I work in gaming and stream events into a self-hosted Clickhouse db without Kafka. We just use the CH python connector and send records in batches of 100K, using ReplacingMergeTree for backfills, etc. It works very well. Unless you truly need up-to-the-minute analytics, it’s super easy to schedule with Dagster or Airflow or whatever. We process 100M+ events per day this way.

jojohohanon 595 days ago

It’s fair tho. This conversation is “if you use clickhouse, then this is how you would solve x”

And a completely fair question is “why would I want to spin up a completely new stack when I have psql already installed?”

In my (very limited ) experience you almost never do want to, but when you do, you wish you had started 6 months ago.

pbowyer 600 days ago

> but every time I try to use it, I get stuck on "how do I get data into it reliably"

That's the same stage I get stuck every time.

I have data emitters (in this example let's say my household IoT devices, feeding a MQTT broker then HomeAssistant).

I have where I want the data to end up (Clickhouse, Database, S3, whatever).

How do I get the data from A to B, so there are no duplicate rows (if the ACK for an upload isn't received when the upload succeeded), no missing rows (the data is retried if an upload fails), and some protection if the local system goes down (data isn't ephemeral)?

The easiest I've found is writing data locally to files (JSON, parquet, whatever), new file every 5 minutes and sync the older files to S3.

But then I'm stuck again. How do I continually load new files from S3 without any repetition or edge cases? And did I really need the intermediate files?

wiredfool 600 days ago

Easiest way is to post csv/json/whatever through the http endpoint into a replacing merge tree table.

Duplicates get merged out, and errors can be handles at the http level. (Admittedly, one bad row in a big batch post is a pain, but I don’t see that much)

Narhem 600 days ago

HTTP errors aren’t the most readable, although traditional database errors aren’t too readable most of the time.

wiredfool 600 days ago

What I meant is that you'll get an HTTP error code from the insert if it didn't work, so that can go through the error handling. This isn't really an "explore this thing", it's a "splat this data in, every minute/file/whatever". I've churned through TBs of CSVs this way, with a small preprocessor to fix some idiosyncratic formatting.

masterj 600 days ago

Cloudflare workers combined with their queues product https://developers.cloudflare.com/queues/ might be a cheap and easy way of solving this problem

maccard 600 days ago

This is _exactly_ my problem, and where I've found myself.

zbentley 600 days ago

This isn't appropriate for all use-cases, but one way to address your and GP's problem is as follows:

1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches.

2. Ship those batches to S3/alike, NFS that clickhouse can read, or equivalent (even a dead-simple HTTP server that just receives file POSTs and writes them to disk, running on storage that clickhouse can reach). The tool you use here needs to be idempotent (retries of failed/timed out uploads don't mangle data), and atomic to readers (partially-received data is never readable).

3. In ClickHouse, run a scheduled refresh of a materialized view pointed at the uploaded data (either "SELECT ... INFILE" for local/NFS files, or "INSERT INTO ... SELECT s3(...)" for an S3/alike): https://clickhouse.com/docs/en/materialized-view/refreshable...

This is only a valid solution given specific constraints; if you don't match these, it may not work for you:

1. You have to be OK with the "experimental" status of refreshable materialized views. My and other users' experience with the feature seems generally positive at this point, and it has been out for awhile.

2. Given your emitter data rates, there must exist a batch size of data which appropriately balances keeping up with uploads to your blob store and the potential of data loss if an emitter crashes before a batch is shipped. If you're sending e.g. financial transaction source-of-record data, then this will not work for you: you really do need a Kafka/alike in that case (if you end up here, consider WarpStream: an extremely affordable and low-infrastructure Kafka clone backed by batching accumulators in front of S3: https://www.warpstream.com/ If their status as a SaaS or recent Confluent acquisition turns you off, fair enough.)

3. Data staleness of up to emitter-flush-interval + worst-case-upload-time + materialized-view-refresh-interval must be acceptable to you.

4. Reliability wise, the staging area for shipped batches (S3, NFS, scratch directory on a clickhouse server) must be sufficiently reliable for your use case, as data will not be replicated by clickhouse while it's staged.

5. All uniqueness/transformations must be things you can express in your materialized view's query + engine settings.

maccard 600 days ago

Thanks for the well thought out reply here. I understand the solution you're proposing, but the thing is that it fails at the first hurdle.

> 1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches.

Clickhouse's Tagline on their website is:

> Build real-time data products that scale

Except, the minute we start having to batch data to process it and stage it, we lose the "real time" part. If I'm shipping them to S3 to have clickhouse batch ingest them, I might as well be use Databricks, Snowflake, or just parquet-on-s3.

zbentley 599 days ago

All very fair, though I think your issue may be more with the nature of real-time analytics ingestion pipelines in general than with Clickhouse itself.

Even if you could remove all of the operational burden from Kafka or equivalent, hooking it up to Clickhouse is still, at the end of the day, going to commit in batches (of max_insert_block_size, or kafka_max_block_size, or smaller batches polled from the message broker). Even with no consumer lag, that's still going to incur a delay before your data is SELECTable.

Heck, even Kafka publishers usually don't flush (actually send over the network) after every publish by default.

That same tradeoff comes up in Snowflake and Databricks (albeit mitigated when using Continuous Processing, which is experimental and expensive computationally and monetarily). Their ingestion systems are batching as well.

At the end of the day, "real time" means different things to different people, and you'll have to choose between one of several architectures:

- Clients synchronously insert data (which is then immediately visible) into your analytics store. ClickHouse is less good at handling a barrage of single-row INSERTs than other DBs, but none of them are good at this type of workload at even medium scale. Even manually shipping single-update files to S3 gets expensive and slow fast.

- Batch your inserts and accept bounded lag in data visibility. Doesn't matter whether batching is client-side, database-side, or in an intermediate broker/service.

- Ship your data asynchronously via messaging/streaming/batching and force point-in-time queries to wait for some indication that asynchronous data for the requested point in time has arrived. For example, when batching manually you could delay queries until a batch subsequent to the time-of-query has arrived, or when using Kafka you could wait for the system of record's last-committed-kafka-message-id to pass your topic's max ID at the time of query.

aynyc 600 days ago

My experience and knowledge with CH is about 3-4 years olds now, so I might be talking out of ignorance at this point.

There are plenty of ways to do it with batching, but I assume you want to real-time "insert into table" style or a direct "ch.write(data)", then no. There is no way as far as I know without batching. This is one of the main reason we stopped CH for our last project about 3 years ago for financial data analytic tooling. CH doesn't have a transaction log like WAL, so your data producers need to be smart or you need a "queue" type service to deal with it, whether it's S3 or Kafka or Kinesis to allow batching.

lossolo 600 days ago

> I search around, and inevitably end up with "by combining clickhouse and Kafka"

Those are probably some old sources of knowledge. You need to use Kafka if you want it to handle batching for you. But Clickhouse can handle batching as well by using asynchronous inserts:

https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...

DeathArrow 600 days ago

It seems you can use JSON, CSV and Parquet: https://clickhouse.com/docs/en/integrations/data-formats

turtlebits 600 days ago

There is an HTTP endpoint, client database drivers, CLI tool and third party tools like Vector, Redpanda Connect?

What makes Clickhouse different that you're unable to load data into?

BohuTANG 596 days ago

Yes, reliable data ingestion often involves Kafka, which can feel complex. An alternative is the transactional COPY INTO approach used by platforms like Snowflake and Databend. This command supports "exactly-once" ingestion, ensuring data is fully loaded or not at all, without requiring message queues or extra infrastructure.

https://docs.databend.com/sql/sql-commands/dml/dml-copy-into...

two_handfuls 600 days ago

Not sure if it's enough for you but there is RedPanda, a Zookeeper-less Kafka.

shawabawa3 600 days ago

I had success loading data with vector.dev

jacobsenscott 599 days ago

This is what we do - works well.

matter_and_mind 598 days ago

I run a fairly large Clickhouse cluster for advertising data with millions of events every minute streaming in. We use fluentd as a buffer which batches data for upto n records/n minutes and does batch inserts to clickhouse. Its not realtime but close enough and have found it to be pretty reliable.

_peregrine_ 600 days ago

I think Tinybird is a nice option here. It's sort of a managed service for ClickHouse with some other nice abstractions. For your streaming case, they have an HTTP endpoint that you can stream to that accepts up to 1k EPS and you can micro-batch events if you need to send more events than that. They also have some good connectors for BigQuery, Snowflake, DynamoDB, etc.

amanj41 600 days ago

Not sure if ClickHouse needs ZK but FWIW Kafka has a raft implementation which now obviates need for ZK

dtjohnnymonkey 600 days ago

ClickHouse does need ZK but they have their own implementation.

ramraj07 600 days ago

Where are you loading the data from! I had no trouble loading data from s3 parquet.

maccard 600 days ago

I'm streaming data from a desktop application written in C++. It's the step to get it into parquet in the first place.

mplanchard 600 days ago

We use this Rust library to do individual and batch inserts: https://docs.rs/clickhouse/latest/clickhouse/

The error messages for batch inserts are TERRIBLE, but once it’s working it just hums along beautifully.

I’d be surprised if there isn’t a similar library for C++, as I believe clickhouse itself is written in C++

andag 600 days ago

There is an http API and it can eat json and csv too (as well as tons of others)

hisnameisjimmy 600 days ago

Fivetran has a destination for it: https://fivetran.com/docs/destinations/clickhouse

VeejayRampay 600 days ago

I was glad in the past few years to discover that I am not alone in finding Kafka off-putting / way too convoluted

dtjohnnymonkey 600 days ago

Where is your data coming from? I’m curious what prevents you from inserting the data into Clickhouse without Kafka.

barumrho 600 days ago

How do you do this with other DBs?

everfrustrated 600 days ago

>Dynamically changing data: allow values with different data types (possibly incompatible and not known beforehand) for the same JSON paths without unification into a least common type, preserving the integrity of mixed-type data.

I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.

Hope to see this leave experimental support soon!

atombender 600 days ago

I never understood why ELK/Kinana chose this method, when there's a much simpler solution: Augment each field name with the data type.

For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.

To search this, the logical choice would be to first make sure that the query language is typed. So a query like value=42 would know to search the int field, while a query like value="42" would look in the string field. There's never any situation where there's any ambiguity about which data type is to be searched. KQL doesn't have this, but that's one of their many design mistakes.

You can do the same for any data type, including arrays and objects. There is absolutely no downside; I've successfully implemented it for a specific project. (OK, one downside: More fields. But the nature of the beast. These are, after all, distinct sets of data.)

mr_toad 600 days ago

> For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.

But now all my queries that look for “value” don’t work. And I’ve got two columns in my report where I only want one.

atombender 600 days ago

The query layer would of course handle this. ELK has KQL, which could do it for you, but it doesn't. That's why I'm saying it's a design mistake.

If your data mixes data types, I would argue that your report (whatever that is) _should_ get two columns.

abe94 601 days ago

We've been waiting for more JSON support for Clickhouse - the new type looks promising - and the dynamic column, and no need to specifcy subtypes is particularly helpful for us.

breadwinner 600 days ago

If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.

[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...

zX41ZdbW 600 days ago

> ClickHouse was designed for single-machine installations

This is incorrect. ClickHouse is designed for distributed setups from the beginning, including cross-DC installations. It has been used on large production clusters even before it was open-sourced. When it became open-source in June 2016, the largest cluster was 394 machines across 6 data-centers with 25 ms RTT between the most distant data-centers.

justCHurious 600 days ago

On a side note, can someone please comment on this part

> for example if you add additional nodes it is not easy to redistribute data.

This is precisely one of the issues I predict we'll face with our cluster as we're ramping up OTEL data and it's being sent to a small cluster, and I'm deathly afraid that it will continue sending to the every shard in equal measure without moving around existing data. I can not find any good method of redistributing the load other than "use the third party backup program and pray it doesn't shit the bed".

MBkkt 599 days ago

It's like saying that postgres was designed for distributed setups, just because there are large postgres installations. We all understand that clickhouse (and postgres) are great databases. But it's strange to call them designed for distributed setups. How about insertion not through a single master? Scalable replication? And a bunch of other important features -- not just the ability to keep independent shards that can be queried in single query

zX41ZdbW 599 days ago

ClickHouse does not have a master replica (every replica is equal), and every machine processes inserts in parallel. It allocates block numbers through the distributed consensus in Keeper. This allows for a very high insertion rate, with several hundred million rows per second in production. The cluster can scale both by the number of shards and by the number of replicas per shard.

Scaling by the number of replicas of a single shard is less efficient than scaling by the number of shards. For ReplicatedMergeTree tables, due to physical replication of data, it is typically less than 10 replicas per shard, where 3 replicas per shard are practical for servers with non-redundant disks (RAID-0 and JBOD), and 2 replicas per shard are practical for servers with more redundant disks. For SharedMergeTree (in ClickHouse Cloud), which uses shared storage and does not physically replicate data (but still has to replicate metadata), the practical number of replicas is up to 300, and inserts scale quite well on these setups.

cvalka 600 days ago

Absolutely incorrect. ClickHouse was created by Yandex and it's cluster ready from day one.

anentropic 599 days ago

Or Apache Doris, which sounds more Clickhouse-y in its performance properties from what I've read

(disclaimer: I have not used either yet)

Plus it has a MySQL-flavoured client connector where Clickhouse does its own thing, so may be easier to integrate with some existing tools.

haolez 600 days ago

What's the use case? Analytics on humongous quantities of data? Something besides that?

breadwinner 600 days ago

Use case is "user-facing analytics", for example consider ordering food from Uber Eats. You have thousands of concurrent users, latency should be in milliseconds, and things like delivery time estimate must updated in real-time.

Spark can do analysis on huge quantities of data, and so can Microsoft Fabric. What Pinot can do that those tools can't is extremely low latency (milliseconds vs. seconds), concurrency (1000s of queries per second), and ability to update data in real-time.

Excellent intro video on Pinot: https://www.youtube.com/watch?v=_lqdfq2c9cQ

listenallyall 600 days ago

I don't think Uber's estimated time-to-arrival is a statistic on which a database vendor, or development team, should brag about. It's horribly imprecise.

akavi 600 days ago

Also isn't something that a (geo)sharded postgres DB with the appropriate indexes couldn't handle with aplomb. Number of orders to a given restaurant can't be more than a dozen a minute or so.

SoftTalker 600 days ago

Especially as restaurants have a limit on their capacity to prepare food. You can't just spin up another instance of a staffed kitchen. Do these mobile-food-ordering apps include any kind of backdown on order acceptance e.g. "Joe's Diner is too busy right now, do you want to wait or try someplace else?"

cyanydeez 600 days ago

What about it's ability to choose pricing based on source-destination and projected incomes.

listenallyall 599 days ago

All you need for this is a dictionary of zip codes and a rating -- normal, high, very high. Given that ZIPs are 5 digits, that's 100,000 records max, just keep it in memory, you don't even need entries for the "normal" ZIPs. Even if you went street-level, I doubt you'd catalog more than a few hundred thousand streets whose income is significantly more than the surrounding area.

All of this ignores the fact that adjusting a restaurant's prices by the customer's expected ability to pay often leads to killing demand among your most frequent and desirable clientele, but that's a different story.

whalesalad 600 days ago

I thought “humongous quantities of data” was a baseline assumption for a discussion involving clickhouse et all?

haolez 600 days ago

It was a genuine question. I was really curious about other use cases besides the obvious one.

notamy 600 days ago

Clickhouse is great stuff. I use it for OLAP with a modest database (~600mil rows, ~300GB before compression) and it handles everything I throw at it without issues. I'm hopeful this new JSON data type will be better at a use-case that I currently solve with nested tuples.

jabart 600 days ago

Similar for us except 700mil rows in one table, 2.5 billion total rows. That's growing quickly because we started shoving OTEL to the cluster. None of our queries seem to phase Clickhouse. It's like magic. The 48 cores per node also helps

philosopher1234 600 days ago

Postgres should be good enough for 300GB, no?

wiredfool 600 days ago

I had a postgres database where the main index (160gb) was larger than the entire equivalent clickhouse database (60gb). And between the partitioning and the natural keys, the primary key index in clickhouse was about 20k per partition * ~ 1k partitions.

Now, it wasn't a good schema to start with, and there was about a factor of 3 or 4 size that could be pulled out, but clickhouse was a factor of 20 better for on disk size for what we were doing.

marginalia_nu 600 days ago

At least in my experience, that's about when regular DBMS:es kinda start to suck for ad-hoc queries. You can push them a bit farther for non-analytical usecases if you're really careful and have prepared indexes that assist every query you make, but that's rarely a luxury you have in OLAP-land.

tempest_ 600 days ago

It depends, if you want to do any kind of aggregation, counts, or count distinct pg falls over pretty quickly.

notamy 600 days ago

Probably, but Clickhouse has been zero-maintenance for me + my dataset is growing at 100~200GB/month. Having the Clickhouse automatic compression makes me worry a lot less about disk space.

whalesalad 600 days ago

For write heavy workloads I find psql to be a dog tbh. I use it everywhere but am anxious to try new tools.

For truly big data (terabytes per month) we rely on BigQuery. For smaller data that is more OLTP write heavy we are using psql… but I think there is room in the middle.

jacobsenscott 599 days ago

Yes, but you're starting to get to the size where you need some real PG expertise to keep the wheels on. If your data is growing CH will just work out of box for a lot longer.

CSDude 600 days ago

When I tried it a few weeks ago, because ClickHouse names the files based on column names, weird JSON keys resulted in very long filenames and slashes and it did not play well with it the file system and gave errors, I wonder that is fixed?

setr 600 days ago

Isn’t that the issue challenge #3 addresses?

https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...

CSDude 600 days ago

Tried with the latest version, but it doesn't solve.

    CREATE TABLE mk3
    ENGINE = MergeTree
    ORDER BY (account_id, resource_type)
    SETTINGS allow_nullable_key = 1
    AS SELECT
        *,
        CAST(content, 'JSON') AS content_json
    FROM file('Downloads/data_snapshot.parquet')

    Query id: 8ddf1377-7440-4b4d-bb8d-955cd0f2b723

    ↑ Progress: 239.57 thousand rows, 110.38 MB (172.49 thousand rows/s., 79.48 MB/s.)                                                                                                          22%
    Elapsed: 4.104 sec. Processed 239.57 thousand rows, 110.38 MB (58.37 thousand rows/s., 26.89 MB/s.)

    Received exception:
    Code: 107. DB::ErrnoException: Cannot open file /var/folders/mc/gndsp71j6zz64pm7j2wz_6lh0000gn/T/clickhouse-local-503e1494-c3fb-4a5e-9514-be5ba7940fec/data/default/mk3/tmp_insert_all_1_1_0/content_json.plan.features.available.core/audio.dynamic_structure.bin: , errno: 2, strerror: No such file or directory. (FILE_DOESNT_EXIST)

Thorrez 600 days ago

>For example, if we have two integers and a float as values for the same JSON path a, we don’t want to store all three as float values on disk

Well, if you want to do things exactly how JS does it, then storing them all as float is correct. However, The JSON standard doesn't say it needs to be done the same way as JS.

barumrho 600 days ago

The new Variant type exists independently of JSON support, so it seems good that they handle it properly.

kreetx 600 days ago

This seems similar to instead of storing any specific part (int, string, array) of JSON, just store any JSON type in the column, much like "enum with fields" in Swift, Kotlin or Rust, or algebraic data types in Haskell - a feature not present in many other languages.

jojohohanon 595 days ago

I’m a few years removed, but isn’t this how google capacitor stores protobufs (which are ~ equivalent to json in what they can express)?

jakozaur 600 days ago

Looks like Snowflake was the first popular warehouse to have variant type which could put JSON values into separate columns.

It turned out great idea which inspired other databases.

karsinkk 600 days ago

Oracle 23ai also has a similar feature that "explodes" JSON into relational tables/columns for storage while still providing JSON based access API's : https://www.oracle.com/database/json-relational-duality/

officex 601 days ago

Great to see! I remember checking you guys out in Q1, great team

fuziontech 600 days ago

Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.

Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!

https://posthog.com/handbook/engineering/databases/materiali...

The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.

baq 601 days ago

Clickhouse is criminally underused.

It's common knowledge that 'postgres is all you need' - but if you somehow reach the stage of 'postgres isn't all I need and I have hard proof' this should be the next tech you look at.

Also, clickhouse-local is rather amazing at csv processing using sql. Highly recommended for when you are fed up with google sheets or even excel.

mrsilencedogood 600 days ago

This is my take too. At one of my old jobs, we were early (very early) to the Hadoop and then Spark games. Maybe too early, because by the time Spark 2 made it all easy, we had already written a lot of mapreduce-streaming and then some RDD-based code. Towards the end of my tenure there, I was experimenting with alternate datastores, and clickhouse was one I evaluated. It worked really, really well in my demos. But I couldn't get buy-in because management was a little wary of the russian side of it (which they have now distanced/divorced from, I think?) and also they didn't really have the appetite for such a large undertaking anymore. (The org was going through some things.) (So instead a different team blessed by the company owner basically DIYd a system to store .feather files on NVME SSDs... anyway).

If I were still there, I'd be pushing a lot harder to finally throw away the legacy system (which has lost so many people it's basically ossified, anyway) and just "rebase" it all onto clickhouse and pyspark sparksql. We would throw away so much shitty cruft, and a lot of the newer mapreduce and RDD code is pretty portable to the point that it could be plugged into RDD's pipe() method.

Anyway. My current job, we just stood up a new product that, from day 1, was ingesting billions of rows (event data) (~nothing for clickhouse, to be clear. but obviously way too much for pg). And it's just chugging along. Clickhouse is definitely in my toolbox right after postgres, as you state.

osigurdson 600 days ago

Agree. CH is a great technology to have some awareness of. I use it for "real things" (100B+ data points) but honestly it can really simplify little things as well.

I'd throw in one more to round it out however. The three rings of power are Postgres, ClickHouse and NATS. Postgres is the most powerful ring however and lots of times all you need.

oulipo 600 days ago

would you recommend clickhouse over duckdb? and why?

nasretdinov 600 days ago

IMO the only reason to not use ClickHouse is when you either have "small" amount of data or "small" servers (<100 Gb of data, servers with <64 Gb of RAM). Otherwise ClickHouse is a better solution since it's a standalone DB that supports replication and in general has very very robust cluster support, easily scaling to hundreds of nodes.

Typically when you discover the need for OLAP DB is when you reach that scale, so I'm personally not sure what the real use case for DuckDB is to be completely honest.

justCHurious 600 days ago

There is another place where you should not use CH, and it's in a system with shared resources. CH loves, and earned the right, to have spikes of hogging resources. They even allude to this on the Keeper setup - if you put the nodes for the two systems in the same machine, CH will inevitably push Keeper off the bed and the two will come to a disagreement. You should not have it on a k8s Pod for that reason, for example. But then again, you shouldn't have ANY storage of that capacity in a k8s pod anyways.

geysersam 600 days ago

DuckDB probably performs better per core than clickhouse does for most queries. So as long as your workload fits on a single machine (it's likely that it does) it's often the most performant option.

Besides, it's so simple, just a single executable.

Of course if you're at a scale where you need a cluster it's not an option anymore.

zX41ZdbW 600 days ago

The good parts of DuckDB that you've mentioned, including the fact that it is a single-executable, are modeled after ClickHouse.

RyanHamilton 600 days ago

Can you provide a reference for that belief? To me that's not true. They started from solving very different problems.

geysersam 600 days ago

I didn't express myself well. What I meant to say was that Duckdb runs a single process. That simplifies things.

Clickhouse typically runs several processes (server, clients) interacting and that already makes things more complicated (and more powerful!).

That's not to say one is good and the other bad, they're just quite different tools.

PeterCorless 600 days ago

Note that every use case is different and YMMV.

https://www.vantage.sh/blog/clickhouse-local-vs-duckdb

hn1986 600 days ago

Great link . Curious how it compares now that Duckdb is 1.0+

theLiminator 600 days ago

Not to mention polars, datafusion, etc. Single node OLAP space is really heating up.

fiddlerwoaroof 600 days ago

Clickhouse scales from a local tool like Duckdb to a database cluster that can back your reporting applications and other OLAP applications.

CalRobert 600 days ago

Clickhouse and Postgres are just different tools though - OLTP vs OLAP.

fiddlerwoaroof 600 days ago

It’s fairly common in my experience for reports to initially be driven by a Postgres database until you hit data volumes Postgres cannot handle.

peteforde 600 days ago

I admit that I didn't read the entire article in depth, but I did my best to meaningfully skim-parse it.

Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?

I have no problem with people creating supersets of JSON, but if my standard lib JSON parser can't read your "JSON" then wouldn't it be better to call it something like "CH-JSON"?

If I am wildly missing something, I'm happy to be schooled. The end result certainly sounds cool, even though I haven't needed ClickHouse yet.

ekimekim 600 days ago

There are two concepts which are being used interchangably here.

The first is JSON as a data encoding, ie. the particular syntax involving braces and quotes and commas and string escapes.

The second is JSON as a data type, ie. a value which may be a string, number, bool, null, array of such values, or map from string to such values. The JSON data type is the set of values which can be represented by the JSON data encoding.

The article describes an optimized storage format for storing values which have the JSON data type. It is not related to JSON the data encoding, except in that it allows input and output using that encoding.

This is the same thing as postgres' JSONB type, which is also an optimized storage format for values of the JSON data type (internally it uses a binary representation).

chirau 600 days ago

The article is about the internal storage mechanics of ClickHouse and how it optimizes handling JSON data behind the scenes. The data types like Dynamic and Variant that are discussed are part of ClickHouse’s internal mechanisms to improve performance, specifically for columnar storage of JSON data. The optimizations just help ClickHouse process and store data more efficiently.

The data remains standard JSON and so standard JSON parsers wouldn’t be affected since the optimizations are part of the storage layer and not the JSON structure itself.

chipdart 600 days ago

> The data remains standard JSON and so standard JSON parsers wouldn’t be affected (...)

No, not really.

The blog post talks about storing JSON data in a column-oriented database.

The blog post talks about importing data from JSON docs into their database. Prior to this, they stored JSON documents in their database like any standard off-the-shelf database does. Now they parse the JSON document when importing, and they store those values in their column-oriented database as key-value pairs, and preserve type information.

The silly part is that this all sounds like a intern project who was tasked with adding support to import data stored in JSON files into a column-oriented database, and an exporter along with it. But no, it seems an ETL job now counts as inventing JSON.

zahlman 600 days ago

Clickhouse is a DBMS. What I understood: by "a new JSON data type for ClickHouse", they don't mean "a new data type added to the JSON standard for the benefit of ClickHouse", but rather "a new data type recognized by ClickHouse (i.e., that can be represented in its databases) which is used for storing JSON data".

lemax 600 days ago

As far as I understand they're talking about the internal storage mechanics of ClickHouse, these aren't user exposed JSON data types, they just power the underlying optimizations they're introducing.

selcuka 600 days ago

Which is the same as PostgreSQL [1] or SQLite [2] that can store JSON values in binary formats (both called JSONB) but when you "SELECT" it you get standard JSON.

[1] https://www.postgresql.org/docs/current/datatype-json.html

[2] https://www.sqlite.org/json1.html

lucianbr 600 days ago

They both store JSON, each in some particular way, but they don't both store it in the same way. Just like they both store tabular data, but not in the same way, and therefore get different performance characteristics.

Are you arguing that since Clickhouse is a database like Postgres, there's no point for CH to exist as we already have Postgres? Column-oriented databases have their uses.

selcuka 600 days ago

> Are you arguing that [...] there's no point for CH to exist

Wow, that escalated quickly. You are reading too much into my comment. You should read the comment thread from the beginning to understand which question I'm replying to.

chipdart 600 days ago

> Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?

I had to scroll way down the article, passing over tons of what feel like astroturfing comments advertising a vendor and their product line, to see the very first comment pointing out the elephant in the room.

I agree, whatever it's described in the blog post is clearly not JSON. It's a data interchange format, and it might be mappable to JSON under the right circumstances, but JSON it is not. It's not even a superset or a subset.

I mean, by the same line of reasoning both toml, CSV, and y'all are JSON. Come on. Even BSON is described as a different format that can be transcoded to JSON.

The article reads like a cheap attempt to gather attention to a format that otherwise would not justify it.

TRiG_Ireland 600 days ago

I don't think it's a data interchange format at all. It's entirely internal to the ClickHouse database. But it supports JSON semantics in a way that databases generally don't.

anonygler 600 days ago

I keep misreading this company as ClickHole and expecting some sort of satirical content.