Hacker News new | ask | show | jobs
by brundolf 1847 days ago
Slightly OT: I've realized that CSVs are dramatically more information-dense than the equivalent JSON, and actually make a pretty reasonable API response format if your dataset is large and fits into the tabular shape. They can be a fraction of the size, mainly because keys aren't duplicated for every item.
6 comments

Yeah, lists of objects are pretty crap, because they're almost always homogeneous but unenforcédly so; it's not just a size issue but a parsing (or not - usage) issue too.

You could approximate CSV in a JSON response like:

    {
      "columns": ["a", ..., "z"],
      "rows": [[1, ..., 26], ..., [11, 2266]]
    }
Or:

    {
      "a": [1, ..., 11],
      ...
      "z": [26, ..., 2266]
    }
which I've never seen, but would save space, and sort of enforced in the sense that if you trust your serialiser for it as much as you trust an equivalent CSV serialiser, it's fine. (But the same argument could be made for more usual JSON object lists. Only arguable difference is that there's more of an assertion to the client that they should be expected to be homogeneous.)
A columnar structure is definitely popular in the analytics community, although there are binary formats that are faster to scan again and can be mmap'd for zero-copy access.

The fundamental problem with CSV is that it has no canonical form or formal construction. Even the RFC documents it by example and historical reference, rather than from first principles, and does so with liberal use of "maybe". Consequently being very easy to fling about as a human but much harder to reason about in the abstract, and this is most evident when you get bogged down in the gritty details of writing a CSV importer for your application.

+1 - If you are looking for something JSON-compatible, JSON Lines (one json object per line - https://jsonlines.org/) is pretty popular as well.

You could store a CSV in JSONL very easily. In fact, jsonlines' website shows it as its first example: https://jsonlines.org/examples/

And if you wanted something that is json file-wide, you can just add some commas and wrap in [] for a list of rows.

jsonlines (and ndjson too, since they're overlapping specs) is amazing. It's definitely the better way to convert between JSON and CSV.

  » ps aux | grep root | head -n5 | format jsonl
  ["root","87596","0.0","0.0","4359648","116","??","Ss","10:01am","0:00.02","com.apple.cmio.registerassistantservice"]
  ["root","81777","0.0","0.0","4321932","88","??","Ss","Wed12am","0:00.01","PlugInLibraryService"]
  ["root","71784","0.0","0.1","4365572","10504","??","Ss","Tue11pm","0:25.88","PerfPowerServices"]
  ["root","42906","0.0","0.0","4321572","88","??","Ss","Tue09am","0:00.01","com.apple.ColorSyncXPCAgent"]
  ["root","47415","0.0","0.0","4303172","88","??","Ss","Sat04am","0:00.01","aslmanager"]

It has the readability of CSV but the stricter formatting of JSON. Win win.
i've known about json lines for a bit (had to make a utility that parsed some output of a program that used jsonlines and was irritated it didn't use the json spec)...

but for some reason just this post made me realize something... i've always been irritated by CSV log files (or space delimited), but would prefer something more structured like JSON, but JSON log files are pretty obnoxious for the reasons mentioned in the comments, a lot of data duplication...

JSON lines for log files seems like a great fit, not sure why i didn't realize this until now, i suppose it was the context of the discussion!

Oh no, this is so similar to http://ndjson.org/
ndjson appears to be a fork of jsonlines with the addition of a spec (see https://github.com/ndjson/ndjson.github.io/issues/1)
I’ll note, your second item here is a structure of arrays, which is often higher performance in practice when you are only interested in certain portions of the data.

For this reason, your second structure is how I serialize code I’m interacting with in C.

... Also known as a very simple case of a “column-oriented database”, of which are several at various scales from Metakit[1] to Clickhouse[2]. It’s a neat way to have columns which are sparsely populated, required to accommodate large blobs, numerous but usually not accessed all at once, or frequently added and deleted.

Nothing’s perfect, of course: you can’t stream records in such a format, so no convenient Unix-style tooling.

[1]: http://www.equi4.com/metakit.html [2]: https://yandex.com/dev/clickhouse/

Higher performance, much smaller compressed and uncompressed, accepted by a wide range of tools, e.g. Pandas, and often more convenient to parse from statically typed languages.
> Yeah, lists of objects are pretty crap, because they're almost always homogeneous but unenforcédly so

We use JSON Schema heavily to enforce structure in JSON data. https://json-schema.org/

i've done the first when serializing an arbitrary table of trace data in a jsonb column. Did it to make it compact and then realized that if all i wanted to do was show it in a UI it was much easier to parse as a html table, and only marginally harder to present an array of objects.
Shouldn’t really matter too much if the response is being compressed.

If you’re rendering the table in the DOM, the response size is the least of your issues.

This is the real answer. All of the other answers are suggesting various changes to the JSON structure to eliminate key repetition, but this is irrelevant under compression.

Where it becomes relevant is if each record is stored as a separate document so you can't just compress them all together. Compressing each record separately won't eliminate the duplication, so you're better off with either a columnar format (like a typical database) or a schema-based format (like protobuf.)

To parse it, you need to check the keys. If there can’t be other keys, you can just use an array which is stable on JSON and you can save on keys.

So you just have an array of arrays. Or even a huge array and every X elements, it’s a new record.

If each one has 2 keys,

    [
        {
            key1: ‘a’,
            key2: ‘b’
        },
        {
            key1: ‘a’,
            key2: ‘b’
        }
    ]
Can become,

    [
        [
            ‘a’,
            ‘b’
        ],
        [
            ‘a’,
            ‘b’
        ]
     ]
Or just every 2 will be a new record,

    [
        ‘a’,
        ‘b’,
        ‘a’,
        ‘b’
    ]
But why? Why save on keys when compression will nearly eliminate them for you?
Compression mainly helps with transmission.

Trying to point out that the original structure allows for more flexibility.

If you only cared about space, this compresses better anyway and uncompressed, it still occupies less space.

Sometimes you fetch a large dataset and only show one page at a time in the DOM, or render it as a line in a chart or something. At a previous workplace we had CSV responses in the hundreds of megabytes.
70 GB CSV files aren't uncommon at my work. It's not really a problem since CSV streams well.
That sounds incredible inefficient

What was the rational for such enormous single payloads?

Without knowing more about the application, I'd guess probably caching and/or scaling. If you only need 1 payload then that can be statically generated and cached in your CDN. Which in turn reduces your dependence on the web servers so few nodes are required and/or you can scale your site more easily to demand. Also compute time is more expensive than CDN costs so there might well be some cost savings there too.
This was basically it. The dataset was the same across users so caching was simple and efficient, and the front-end had no difficulty handling this much data (and paging client-side was snappier than requesting anew each time)
It’s because it’s self describing right? If you look at protobuf, thrift, avro those are even denser
If you'd use a column-oriented format like {"col1":["a","b","c",...],"col2":[1,2,3...],...}, it's about the same density, no?
This is the default format used by pandas.DataFrame.to_dict()

I usually need a less dense version, e.g. to send to a jinja2 template, so mostly use to_dict(orient='index').

It would be much nicer for the consumer to just de-dupe the keys in your json than to serve an annoying format like CSV. Your JSON could basically be a matrix with a header row, there's nothing forcing you to duplicate keys.

  { header: [...columnNames], rows: [...values2DArray]}
Make rows 1 dimensional. You don’t need the second dimension, it’s implied by header length. Once you do this, the JSON gzips down to about the same size as CSV, according to the last time I tested this IIRC.
I edited-in the "2DArray" because I thought it was confusing... But you're right, just calculate offsets. The dominating term is still quadratic, and the term you mentioned is linear. It could be worth it for a scaled org like Google!

I wonder which parses faster. I guess CSV does but then the consuming code would still have to parse the strings into JS primitives...

I haven't tested this, but my guess is that because the browser built in JSON.parse will be faster than whatever CSV parser you can write in JS just because it's precompiled to native code. Then the question becomes how long does it take to do the unpacking loop, but it should be pretty quick.

I'd love it if someone did a benchmark though.

Heck, you could do a single 1-D list (no object), and just give the header count as the first element, which would be even more compact.
Smart idea.
Hell is other people's CSVs.