Hacker News new | ask | show | jobs
by peteforde 600 days ago
I admit that I didn't read the entire article in depth, but I did my best to meaningfully skim-parse it.

Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?

I have no problem with people creating supersets of JSON, but if my standard lib JSON parser can't read your "JSON" then wouldn't it be better to call it something like "CH-JSON"?

If I am wildly missing something, I'm happy to be schooled. The end result certainly sounds cool, even though I haven't needed ClickHouse yet.

5 comments

There are two concepts which are being used interchangably here.

The first is JSON as a data encoding, ie. the particular syntax involving braces and quotes and commas and string escapes.

The second is JSON as a data type, ie. a value which may be a string, number, bool, null, array of such values, or map from string to such values. The JSON data type is the set of values which can be represented by the JSON data encoding.

The article describes an optimized storage format for storing values which have the JSON data type. It is not related to JSON the data encoding, except in that it allows input and output using that encoding.

This is the same thing as postgres' JSONB type, which is also an optimized storage format for values of the JSON data type (internally it uses a binary representation).

The article is about the internal storage mechanics of ClickHouse and how it optimizes handling JSON data behind the scenes. The data types like Dynamic and Variant that are discussed are part of ClickHouse’s internal mechanisms to improve performance, specifically for columnar storage of JSON data. The optimizations just help ClickHouse process and store data more efficiently.

The data remains standard JSON and so standard JSON parsers wouldn’t be affected since the optimizations are part of the storage layer and not the JSON structure itself.

> The data remains standard JSON and so standard JSON parsers wouldn’t be affected (...)

No, not really.

The blog post talks about storing JSON data in a column-oriented database.

The blog post talks about importing data from JSON docs into their database. Prior to this, they stored JSON documents in their database like any standard off-the-shelf database does. Now they parse the JSON document when importing, and they store those values in their column-oriented database as key-value pairs, and preserve type information.

The silly part is that this all sounds like a intern project who was tasked with adding support to import data stored in JSON files into a column-oriented database, and an exporter along with it. But no, it seems an ETL job now counts as inventing JSON.

Clickhouse is a DBMS. What I understood: by "a new JSON data type for ClickHouse", they don't mean "a new data type added to the JSON standard for the benefit of ClickHouse", but rather "a new data type recognized by ClickHouse (i.e., that can be represented in its databases) which is used for storing JSON data".
As far as I understand they're talking about the internal storage mechanics of ClickHouse, these aren't user exposed JSON data types, they just power the underlying optimizations they're introducing.
Which is the same as PostgreSQL [1] or SQLite [2] that can store JSON values in binary formats (both called JSONB) but when you "SELECT" it you get standard JSON.

[1] https://www.postgresql.org/docs/current/datatype-json.html

[2] https://www.sqlite.org/json1.html

They both store JSON, each in some particular way, but they don't both store it in the same way. Just like they both store tabular data, but not in the same way, and therefore get different performance characteristics.

Are you arguing that since Clickhouse is a database like Postgres, there's no point for CH to exist as we already have Postgres? Column-oriented databases have their uses.

> Are you arguing that [...] there's no point for CH to exist

Wow, that escalated quickly. You are reading too much into my comment. You should read the comment thread from the beginning to understand which question I'm replying to.

> Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?

I had to scroll way down the article, passing over tons of what feel like astroturfing comments advertising a vendor and their product line, to see the very first comment pointing out the elephant in the room.

I agree, whatever it's described in the blog post is clearly not JSON. It's a data interchange format, and it might be mappable to JSON under the right circumstances, but JSON it is not. It's not even a superset or a subset.

I mean, by the same line of reasoning both toml, CSV, and y'all are JSON. Come on. Even BSON is described as a different format that can be transcoded to JSON.

The article reads like a cheap attempt to gather attention to a format that otherwise would not justify it.

I don't think it's a data interchange format at all. It's entirely internal to the ClickHouse database. But it supports JSON semantics in a way that databases generally don't.