| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brundolf 1847 days ago
	Slightly OT: I've realized that CSVs are dramatically more information-dense than the equivalent JSON, and actually make a pretty reasonable API response format if your dataset is large and fits into the tabular shape. They can be a fraction of the size, mainly because keys aren't duplicated for every item.

6 comments

OJFord 1847 days ago

Yeah, lists of objects are pretty crap, because they're almost always homogeneous but unenforcédly so; it's not just a size issue but a parsing (or not - usage) issue too.

You could approximate CSV in a JSON response like:

    {
      "columns": ["a", ..., "z"],
      "rows": [[1, ..., 26], ..., [11, 2266]]
    }

Or:

    {
      "a": [1, ..., 11],
      ...
      "z": [26, ..., 2266]
    }

which I've never seen, but would save space, and sort of enforced in the sense that if you trust your serialiser for it as much as you trust an equivalent CSV serialiser, it's fine. (But the same argument could be made for more usual JSON object lists. Only arguable difference is that there's more of an assertion to the client that they should be expected to be homogeneous.)

inopinatus 1847 days ago

A columnar structure is definitely popular in the analytics community, although there are binary formats that are faster to scan again and can be mmap'd for zero-copy access.

The fundamental problem with CSV is that it has no canonical form or formal construction. Even the RFC documents it by example and historical reference, rather than from first principles, and does so with liberal use of "maybe". Consequently being very easy to fling about as a human but much harder to reason about in the abstract, and this is most evident when you get bogged down in the gritty details of writing a CSV importer for your application.

scrollaway 1847 days ago

+1 - If you are looking for something JSON-compatible, JSON Lines (one json object per line - https://jsonlines.org/) is pretty popular as well.

You could store a CSV in JSONL very easily. In fact, jsonlines' website shows it as its first example: https://jsonlines.org/examples/

And if you wanted something that is json file-wide, you can just add some commas and wrap in [] for a list of rows.

hnlmorg 1847 days ago

jsonlines (and ndjson too, since they're overlapping specs) is amazing. It's definitely the better way to convert between JSON and CSV.

  » ps aux | grep root | head -n5 | format jsonl
  ["root","87596","0.0","0.0","4359648","116","??","Ss","10:01am","0:00.02","com.apple.cmio.registerassistantservice"]
  ["root","81777","0.0","0.0","4321932","88","??","Ss","Wed12am","0:00.01","PlugInLibraryService"]
  ["root","71784","0.0","0.1","4365572","10504","??","Ss","Tue11pm","0:25.88","PerfPowerServices"]
  ["root","42906","0.0","0.0","4321572","88","??","Ss","Tue09am","0:00.01","com.apple.ColorSyncXPCAgent"]
  ["root","47415","0.0","0.0","4303172","88","??","Ss","Sat04am","0:00.01","aslmanager"]

It has the readability of CSV but the stricter formatting of JSON. Win win.

aae42 1847 days ago

i've known about json lines for a bit (had to make a utility that parsed some output of a program that used jsonlines and was irritated it didn't use the json spec)...

but for some reason just this post made me realize something... i've always been irritated by CSV log files (or space delimited), but would prefer something more structured like JSON, but JSON log files are pretty obnoxious for the reasons mentioned in the comments, a lot of data duplication...

JSON lines for log files seems like a great fit, not sure why i didn't realize this until now, i suppose it was the context of the discussion!

DangitBobby 1847 days ago

Oh no, this is so similar to http://ndjson.org/

boygobbo 1847 days ago

ndjson appears to be a fork of jsonlines with the addition of a spec (see https://github.com/ndjson/ndjson.github.io/issues/1)

cowsandmilk 1847 days ago

I’ll note, your second item here is a structure of arrays, which is often higher performance in practice when you are only interested in certain portions of the data.

For this reason, your second structure is how I serialize code I’m interacting with in C.

mananaysiempre 1847 days ago

... Also known as a very simple case of a “column-oriented database”, of which are several at various scales from Metakit[1] to Clickhouse[2]. It’s a neat way to have columns which are sparsely populated, required to accommodate large blobs, numerous but usually not accessed all at once, or frequently added and deleted.

Nothing’s perfect, of course: you can’t stream records in such a format, so no convenient Unix-style tooling.

[1]: http://www.equi4.com/metakit.html [2]: https://yandex.com/dev/clickhouse/

dmw_ng 1847 days ago

Higher performance, much smaller compressed and uncompressed, accepted by a wide range of tools, e.g. Pandas, and often more convenient to parse from statically typed languages.

1vuio0pswjnm7 1847 days ago

https://shakti.com

jarofgreen 1847 days ago

> Yeah, lists of objects are pretty crap, because they're almost always homogeneous but unenforcédly so

We use JSON Schema heavily to enforce structure in JSON data. https://json-schema.org/

amichal 1847 days ago

i've done the first when serializing an arbitrary table of trace data in a jsonb column. Did it to make it compact and then realized that if all i wanted to do was show it in a UI it was much easier to parse as a html table, and only marginally harder to present an array of objects.

sonthonax 1847 days ago

Shouldn’t really matter too much if the response is being compressed.

If you’re rendering the table in the DOM, the response size is the least of your issues.

ludocode 1847 days ago

This is the real answer. All of the other answers are suggesting various changes to the JSON structure to eliminate key repetition, but this is irrelevant under compression.

Where it becomes relevant is if each record is stored as a separate document so you can't just compress them all together. Compressing each record separately won't eliminate the duplication, so you're better off with either a columnar format (like a typical database) or a schema-based format (like protobuf.)

rovr138 1847 days ago

To parse it, you need to check the keys. If there can’t be other keys, you can just use an array which is stable on JSON and you can save on keys.

So you just have an array of arrays. Or even a huge array and every X elements, it’s a new record.

If each one has 2 keys,

    [
        {
            key1: ‘a’,
            key2: ‘b’
        },
        {
            key1: ‘a’,
            key2: ‘b’
        }
    ]

Can become,

Or just every 2 will be a new record,

    [
        ‘a’,
        ‘b’,
        ‘a’,
        ‘b’
    ]

ludocode 1847 days ago

But why? Why save on keys when compression will nearly eliminate them for you?

rovr138 1847 days ago

Compression mainly helps with transmission.

Trying to point out that the original structure allows for more flexibility.

If you only cared about space, this compresses better anyway and uncompressed, it still occupies less space.

brundolf 1847 days ago

Sometimes you fetch a large dataset and only show one page at a time in the DOM, or render it as a line in a chart or something. At a previous workplace we had CSV responses in the hundreds of megabytes.

nly 1847 days ago

70 GB CSV files aren't uncommon at my work. It's not really a problem since CSV streams well.

cerved 1847 days ago

That sounds incredible inefficient

What was the rational for such enormous single payloads?

hnlmorg 1847 days ago

Without knowing more about the application, I'd guess probably caching and/or scaling. If you only need 1 payload then that can be statically generated and cached in your CDN. Which in turn reduces your dependence on the web servers so few nodes are required and/or you can scale your site more easily to demand. Also compute time is more expensive than CDN costs so there might well be some cost savings there too.

brundolf 1846 days ago

This was basically it. The dataset was the same across users so caching was simple and efficient, and the front-end had no difficulty handling this much data (and paging client-side was snappier than requesting anew each time)

killingtime74 1847 days ago

It’s because it’s self describing right? If you look at protobuf, thrift, avro those are even denser

quantumofalpha 1847 days ago

If you'd use a column-oriented format like {"col1":["a","b","c",...],"col2":[1,2,3...],...}, it's about the same density, no?

rahimnathwani 1846 days ago

This is the default format used by pandas.DataFrame.to_dict()

I usually need a less dense version, e.g. to send to a jinja2 template, so mostly use to_dict(orient='index').

anonytrary 1847 days ago

It would be much nicer for the consumer to just de-dupe the keys in your json than to serve an annoying format like CSV. Your JSON could basically be a matrix with a header row, there's nothing forcing you to duplicate keys.

  { header: [...columnNames], rows: [...values2DArray]}

earthboundkid 1847 days ago

Make rows 1 dimensional. You don’t need the second dimension, it’s implied by header length. Once you do this, the JSON gzips down to about the same size as CSV, according to the last time I tested this IIRC.

anonytrary 1847 days ago

I edited-in the "2DArray" because I thought it was confusing... But you're right, just calculate offsets. The dominating term is still quadratic, and the term you mentioned is linear. It could be worth it for a scaled org like Google!

I wonder which parses faster. I guess CSV does but then the consuming code would still have to parse the strings into JS primitives...

earthboundkid 1846 days ago

I haven't tested this, but my guess is that because the browser built in JSON.parse will be faster than whatever CSV parser you can write in JS just because it's precompiled to native code. Then the question becomes how long does it take to do the unpacking loop, but it should be pretty quick.

I'd love it if someone did a benchmark though.

dragonwriter 1847 days ago

Heck, you could do a single 1-D list (no object), and just give the header count as the first element, which would be even more compact.

earthboundkid 1846 days ago

Smart idea.

jpitz 1847 days ago

Hell is other people's CSVs.