If it's tabular, self-describing formats have way too much overhead. I ran a query with a tabular result in the neighborhood of 100 columns by 215k rows, and exported it in multiple formats:
- CSV: 166mb
- JSON: 795mb
That said, not all data is tabular.
DuckDB already supports Parquet, which supports structs and is a very good format for storing data for reporting workloads. But JSON is a standard interchange format, so a lot of people are going to want to do something with JSON payloads they receive from API calls.
I could definitely imagine a workload where you receive JSON from an API call, load it into DuckDB or similar to help with ETL, then store results in Parquet.
For me it depends a lot on the context. JSON is often very human readable (as long as it's not too deeply nested), fairly well defined (compared to CSVs), and most languages and software have easy out of the box support for parsing and manipulating it.
If I were building a system that had to deal with large amounts of tabular data that isn't directly consumed by humans, JSON wouldn't be my first choice nor my last.
It's interesting that JSON is still the format of choice for transmitting tabular data to SPAs and mobile apps. Granted, it's likely compressed. But still seems something more efficient like CSV would be better.
This is very true. DuckDB does not support JSON because it’s a good tabular format, but because JSON is ubiquitous, and there are many use cases where querying JSON dumps for analytics is useful.
My love for line-based data formats have increased over time. CSV, JSON-string per line and so on. You can always append to the data and you can deserialize line-by-line.
DuckDB already supports Parquet, which supports structs and is a very good format for storing data for reporting workloads. But JSON is a standard interchange format, so a lot of people are going to want to do something with JSON payloads they receive from API calls.
I could definitely imagine a workload where you receive JSON from an API call, load it into DuckDB or similar to help with ETL, then store results in Parquet.