Internet Object – A JSON alternative data serialization format

Y	Hacker News new \| ask \| show \| jobs

	Internet Object – A JSON alternative data serialization format (internetobject.org)
	84 points by Starz0r 1704 days ago

38 comments

nikeee 1704 days ago

Since strings don't need to be quoted, what happens during deserialization if you want the string "T"? Does this lead to the equivalent of the Norway-Problem of YAML [0]?

Is the space between the key and the type necessary? If not, how to distinguish between objects and types?

Does the validation offer some form of unions or mutual exclusion?

[0]: https://hitchdev.com/strictyaml/why/implicit-typing-removed/

link

cookiengineer 1704 days ago

YAML and its "Arrays" are really broken. The problem I see with Internet Object is that it's also implying this kind of mechanism.

Every time I read about new formats, they seem to get either the 1-n relations or the n-n relations implemented well, but not both. I guess that's what's so hard about map/reduce...

Regarding YAML: somebody on HN mentioned his project DIXY a couple years ago, and it's much much _much_ easier to parse than YAML. [1] I'm using this over YAML pretty much everywhere now.

[1] https://github.com/kuyawa/Dixy

link

BiteCode_dev 1703 days ago

Yaml has so many problems. Python 3.10 raised a new one to my attention when the core devs realized their arrays of versions contained twice 3.1 and no 3.10. Indeed, if write unquotted ascii, yaml gives you strings. Except if it can cast it to a number that is.

TOML is better, but it still has more gotchas that necessary. So much I find it easier to just edit a python file

I'm thinking of giving a try to cue. Any feedback ?

link

irq-1 1703 days ago

Dixy looks easy, but "There is only one simple rule. In Dixy, everything is a dictionary [string:string]" isn't accurate or helpful.

It's also [string:dictionary] and [string:?] where ? means nil. White space matters, and tab is fixed at 4 spaces wide. When creating text from a dictionary it adds "# Dixy 1.0\n\n" which means loading and saving will change the file every time! Not sure what other issues there are, but I noticed this line:

    // TODO: if key is numeric, parse as Array

It does look simple though. It'd be nice if someone made strict rules and addressed the corner cases.

link

cabalamat 1704 days ago

> YAML and its "Arrays" are really broken.

Agreed. YAML does have some use cases. I find it useful when I want to manually write lots of JSON data for test scripts. But the format, because it tries to be concise, ends up to be hard to manually parse.

I don't consider YAML a good serialisation format.

link

29athrowaway 1703 days ago

The annoyance of YAML is the possibility of doing things in different ways.

link

colejohnson66 1704 days ago

I’ll admit that YAML has its quirks, but a good syntax highlighter can take care of that in my experience. What’s wrong with YAML’s arrays?

link

cookiengineer 1704 days ago

> What’s wrong with YAML’s arrays?

That there are multiple ways to define Arrays: "- item", "-\n\titem", "\titem" or "item, item" for starters. Parsing YAML into Arrays requires context of its surroundings.

Without the previous context, you cannot know what type of data you're parsing when you are at a "-" at the beginning of a line or a "," in the middle of a line.

This is just unnecessary parser complexity and human ambiguity in my opinion.

As a question to you in case you disagree: What happens when you write down an indented/nested "\t- name: John, Doe"? It's pretty much unpredictable without the previously parsed data structures or their history in YAML.

(I don't wanna start the discussion of "<<" and how it influences the parsing context of YAML data structures. I think the merge key also has no place in a data serialization format.)

link

lifthrasiir 1704 days ago

It seems to be a typed CSV, so whether `T` is interpreted as a string or a boolean presumably depends on the schema. That sounds slightly better than YAML, though it can easily break when you allow heterogeneous types (say, string or boolean).

link

petre 1703 days ago

T is quite dumb. The author should had at least used #t and #f from Scheme.

link

DemocracyFTW 1703 days ago

The so-called "Norway Problem" of YAML is really the No-Way Problem of YAML. /s

link

Hurtak 1704 days ago

The whole thing seems to be dead. There is one blog post from 2019 (https://internetobject.org/the-story/) and the Twitter account also was active only in 2019 (https://twitter.com/InternetObject).

link

jdsampayo 1703 days ago

Moderator should add (2019) to the title, as there has not been any update.

link

flqn 1704 days ago

I'm sceptical about the value proposition of this without seeing much more than a simple example that offers little over existing hypermedia+json/csv practices.

If a compact columnar representation is what you're after to avoid having to repeat every field name in an array of objects (which CSV is good for) but you don't want to give up the ability to include metadata in your JSON, there are a ton of different ways for structure your document to solve this issue without inventing new document formats.

Also this example is unclear (possibly ambiguous?); how is "int" as a type for the "age" column distinguished from "street", "city", etc as what I assume are field names?

link

samhw 1704 days ago

> If a compact columnar representation is what you're after to avoid having to repeat every field name in an array of objects (which CSV is good for)

Plus, as I wrote elsewhere, gzipping your JSON will result in essentially "avoiding having to repeat every field name" by dictionary coding it. The only case in which that wouldn't be true is when dealing with extremely unusual and heteromorphic data, but then this format doesn't seem to support such data at all.

I'm also mystified that the author claims this is readable. It looks eminently unreadable compared with JSON, if you have anything beyond one row of very simple data with all optional fields present. And, in that case, it's basically just 'JSON with the keys on a different row'.

(Congrats to the author, but this is more of a fun personal project rather than something to seriously present as a 'JSON killer'. If you do present it as a JSON killer, then you have to expect a rigorous review.)

link

fstrthnscnd 1703 days ago

> Plus, as I wrote elsewhere, gzipping your JSON will result in essentially "avoiding having to repeat every field name" by dictionary coding it.

Gzipping indeed helps in getting mostly back the space taken by the field names, but a parser will still have to parse these strings. On a large document, this might have a performance impact.

One good side of having the field names however is that one can reorder them adlib.

link

samhw 1703 days ago

That's true, but the main argument made by the website is about the space advantage, so it's very relevant that that space advantage is basically nullified by the widespread use of compression.

If your worry is parsing speed, then JSON not only has battle-tested parsers, but also has SIMD-assisted parsers which can process gigabytes a second on a single core (e.g. https://github.com/simdjson/simdjson). It would take Internet Object years to develop parsers as performant as that, even if it did, by some miracle, achieve wide uptake. So the notional advantage afforded by not having keys on each row is neither here nor there.

And incidentally, as someone who's written a handful of parsers, I suspect that this scheme would not be particularly easy to parse. You need lookahead because of optional fields, as well as maintaining state and a lookup table for mapping positions to keys, etc. I can draw up a quick parser in pseudocode or Python to explain, if you disagree.

link

fstrthnscnd 1702 days ago

> If your worry is parsing speed

I am not personally worried by perf in either case, but I see your point.

> It would take Internet Object years to develop parsers as performant as that

Well, implementing a JSON parser is arguably difficult, for many reasons, I suppose the main one is the flexibility it provides. I don't know if this would be the case for this format however. TBH, I doesn't seem to add too much to CSV, and perhaps it would be simpler to use CSV with the first line of this format has a hint for the data structure.

link

Someone 1703 days ago

Looked for a spec, but couldn’t find it, so here’s a _guess_: there’s significant whitespace between the colon and the opening brace:

  age:{int, min:20},
  address: {street, city, state}

Alternatively, there may be a set of forbidden field names, including bool, int and string.

Of these two, I like neither, but would opt for the latter.

I also considered that min:20 implied the previous had to be a type, but I don’t see how that’s consistent with

  active?:bool

and

  tags?:[string]

link

tomrod 1704 days ago

I agree. CSV + Metadata/field types (which JSON can handle) plus zipping (dictionary coding) takes care of, what, 99.9999% of the issues folks have with one type or the other?

link

tom_ 1704 days ago

Where is the spec? Why are there spaces after the commas? Why does the example not include a string with commas in it?

link

Gys 1704 days ago

For me a JSON alternative should at the very least offer some spec for adding comments anywhere.

link

random478101 1704 days ago

One of the variants that permit comments: https://github.com/tailscale/hujson

link

danfritz 1704 days ago

Looks like CSV described. Gzip the json and don't care about the biggest selling point

link

samhw 1704 days ago

Precisely. Many people don't realise how exceptionally well JSON compresses. Provided you're using it the way most do, to send arrays of objects which share the same set of keys (or some subset thereof), then all the keys will end up dictionary-coded away, thus totally eroding the space advantage that this format notionally has.

Plus JSON's exceptionally wide support means you can benefit from SIMD-assisted decoders which will absolutely blow this out of the water – and much, much more besides. I wish people would devote their time to something more useful than 'yet another competing standard'.

Edit: Sorry, I want to be clear, this is an impressive and cool personal project. I hope it's a step on an exciting journey for the person who wrote it. It just doesn't actually have enough strengths to replace JSON - which would be a tall order for any new format.

link

dorongrinstein 1704 days ago

Looks neat. I don't see a formal spec. Question: if I have two optional fields of the same type and the first one isn't provided, how does a parser know which field is provided? The optional fields seem unclear to me.

link

snidane 1704 days ago

Json is a good format to represent results of aggregation queries (group by in sql) using nesting and storing data in a single file.

Without that you would need to either

  1. store multiple not-nested (tabular, eg. csv) files and join them at the time of use.
  2. denormalize all these csvs into a single big csv duplicating the same values over and over. Compression should handle this at storage time, bht you still pay the cost when reading.
  3. store values by columns, not by rows, adding various RLE and dict encodings to compress repeated values in columns, making the files not human friendly
  4. once you store it in columns and make it unreadable, just store it as binary instead of text. You get parquet

Json and csb are simple and for that reason they won and will stay with us no matter how hard you try to add features to it.

That said I think adding a trailing comma and comments to json wouldn't be a big stretch.

The battle will be for the best columnar binary format. Parquet is the closest to a standard, but it seems to be used only as a standard for a storage. Big data systems still uncompress it and work with their own representation. The holy grail is when you get a columnar format which is good enough that big data systems use it as their underlying data representation instead of coming up with their own. I suspect such format will come from something like open sourced Snowflake, Clickhouse, Chaossearch or something like that, which has battle tested performant algorithms on them, instead of designed by committee, such as parquet.

link

throwaway81523 1704 days ago

> That said I think adding a trailing comma and comments to json wouldn't be a big stretch.

Sadly, json's designers suffered from the same hubris as the designers of markdown and gemini, when they decided to not include a version number in the file format. So you are kind of hosed if you want to make a change like that.

Before json there was xml (ugh), but before xml there were Lisp S-expressions, which seem to have handled all these issues perfectly well 50 years ago. Yet we keep re-inventing them. Greenspun's tenth law is still with us.

link

snidane 1703 days ago

It's just a matter of parser implementation. These changes are backwards compatible. If python decided to add support for comments and trailing commas in json.loads, that would become the new standard, at least for data scientists, not for web devs. All the other ones would then follow.

link

throwaway81523 1703 days ago

Now whatever generates your data has to know what parser is going to read the data. The parser can't tell right away whether the data has those trailing commas. They are optional, so they might not start appearing until after gigabytes of output have gone by. So you can't count on a quick error message in the event of a version mismatch.

link

samhw 1703 days ago

If you have gigabytes of handwritten JSON (if it's not handwritten, trailing vs non-trailing commas surely don't matter), then I feel like you're doing something wrong.

Though I'm sure someone's going to step in and say "Have you not heard of [stupendously niche use case]? Are you living under a rock!?" etc etc ;)

link

throwaway81523 1703 days ago

It's silly to not write your software to handle every possible input instead of every input you think is likely based on some predictions about humans. Failure to do that is why YAML is so broken.

JSON isn't a format conducive to handwriting even if it probably should have made more accomodations for that at the start. Right now it can't even handle trailing newlines. But if you want to fix that, call it something different (maybe even JSON2), for heaven's sake.

I doubt anyone would handwrite an entire gigabyte JSON document, but they might hand-edit a machine-generated one to make a change someplace in it, end up putting in a trailing comma, and have the document pass their local tests but crash a remote parser.

link

liuliu 1704 days ago

You mean, Apache Arrow?

link

snidane 1703 days ago

Partially.

The problem with Apache Arrow and Parquet is that you have two - one for storage and one for computation - but in the end you only want one for both. You want to run fast algorithms on memory mapped compressed columns. Not doing this stupid deserialization from parquet to arrow.

Parquet and arrow are designed by committee and try to accomplish too much for that matter. While that's good for some cases, my prediction is that there will exist a data processing system in the future whose file format will support that and be good enoigh for most data intensive applications. It will not be feature complete, like json, but will be good enough. Some devs from then on will complain about adding this and that feature to that format, but majority will be happy as they are now with json. Such format can only come from industry, not from a committee.

link

liuliu 1703 days ago

Right. That's why I am more interested in arrow than parquet. Going from a pure compressed storage format to incorporate computation would be more difficult than going from memory-mapped / computation format to long-term storage. Arrow already made some good choices regarding data exchange over wire, these are translatable to data exchange over time.

Of course, I am only dealing with a few hundreds GiB data, not sure at larger scale whether arrow fails.

link

dang 1703 days ago

A couple small past threads:

JSON Alternative – Internet Object - https://news.ycombinator.com/item?id=21220405 - Oct 2019 (12 comments)

Show HN: Internet Object – a thin, robust and schema oriented JSON alternative - https://news.ycombinator.com/item?id=20982180 - Sept 2019 (8 comments)

link

beardyw 1704 days ago

> age:{int, min:20}, address: {street, city, state}

Unless the space after the colon is significant it seems we have to just "know" that int introduces a type definition instead of a structure.

Also

> Schema Details JSON doesn't have built-in schema support!

seems a little disingenuous. JSON provides a name for each type of value, so there is mostly no need for the schema when viewing the data. There is a JSON Schema definition.

link

kabes 1704 days ago

Yeah, this format looks really badly designed

link

aamironline 1694 days ago

Hey everyone,

I am the creator of the Internet Object. I have been silently working on the specs. But due to my busy schedule, I was not very active during the past couple of months. It is good to see all of you are discussing the pre-released format! However, I see many people have presumed many things in the wrong context. I want to share the draft of in-progress specs. It will probably bring in more clarity. Recently I have resumed working on this project again. If anyone would like to contribute Internet Object please join the discord channel (Just created).

Specs Draft - https://docs.internetobject.org/ Discord Channel - https://discord.gg/kZ6CD3hF

Thanks and Regards - Aamir

link

mccanne 1703 days ago

This is a very real problem being addressed here and I am intrigued by all the great comments in this thread.

In the Zed project, we've been thinking about and iterating on a better data model for serialization for a few years, and have concluded that schemas kind of get in the way (e.g., the way Parquet, Avro, and JSON Schema define a schema then have a set of values that adhere to the schema). In Zed, a modern and fine-grained type system allows for a structure that is a superset of both the JSON and the relational models, where a schema is simply a special case of the type system (i.e., a named record type).

If you're interested, you can check out the Zed formats here... https://github.com/brimdata/zed/tree/main/docs/formats

link

mccanne 1703 days ago

Also, if any of you find problems with the Zed spec(s), we'd love to hear about them. "Now" would be a good time to make changes / fix flaws.

link

petre 1703 days ago

I'd like to see more examples and probably data serialized as zed.

link

mccanne 1703 days ago

There are a few examples in the ZSON spec...

https://github.com/brimdata/zed/blob/main/docs/formats/zson....

And you can easily see whatever data you'd like formatted as ZSON using the "zq" CLI tool, but I just made this gist (with some data from the brimdata/zed-sample-data report) so you can have a quick look (the bstring stuff is a little noisy and an artifact of the data source being Zeek)... https://gist.github.com/mccanne/94865d557ca3de8abfd3eb09e8ac...

link

kesor 1704 days ago

ffs please don't add yet another stupid standard. this looks like a complicated version of csv, which is horrible, and this also looks quite horrible.

link

account-5 1704 days ago

I've been looking at data serialisation formats recently.

- JSON - TOML - CSON - INI - ENO - XML

I like CSV for tabular data obviously. This looks, as others have mentioned, like CSV with better metadata.

I like INI for its simplicity. JSON is good for more complicated data, but I have to say I like CSON.

link

random478101 1704 days ago

As far I can see "IO" addresses the size issue, which is indeed a compression issue for the most part.

For a broader take on an alternative, there is concise encoding Concise Encoding [1][2], which I believe addresses a few more issues with existing encodings (clear spec, schema not an afterthought, native support for a variety of data structures, security, ...).

[1] https://concise-encoding.org/ [2] The author gave a presentation on it here: https://www.youtube.com/watch?v=_dIHq4GJE14

link

wffurr 1704 days ago

People keep saying “just use gzip and JSON is plenty small” but gzip isn’t free. It takes time and power to do all the compression and decompression. The uncompressed size of the data takes up memory on client and server.

A smaller data format requires less compression time and power and you can fit more of it in memory at either end.

link

petre 1703 days ago

There's Messagepack and CBOR and Flat Buffers. All of them are faster and smaller than any text based format.

link

musicale 1703 days ago

Looks like CSV with a schema, which is OK but can become unreadable if your field (column) space is large and sparse (imagine 50 different optional fields of the same type.)

I still kind of like classic NeXT (and pre-XML OS X) property lists.

GNUstep seems address some of their limitations:

http://wiki.gnustep.org/index.php/Property_Lists https://everything.explained.today/Property_list/

I think Apple probably erred in switching to XML.

link

deadfish 1704 days ago

I see a benchmark for the data size... But as other comments have suggested gzip should remove the majority of that difference.

I'd be more interested to know about serialisation and deserialisation time.

link

only_as_i_fall 1704 days ago

If you follow that link which says "Read the Story Here" they have this json example which has a list of employees and then info about the pagination of that list. The caption is this

>If you look closely, this JSON document mixes the data employees with other non-data keys (headers) such as count, currentPage, and pageSize in the same response.

But they don't explain at all how Changing the data format fixes the underlying issue of mixed concerns in one data object.

link

Azsy 1704 days ago

What ever the pros and cons are here... What the ** does this mean?

> Name , Email > Remain updated, we'll email you when it is available.

Why do this? Should i read that the format isn't ready? Is there going to be a mailing list of format enthusiast? Are you planning on releasing a V2022 next year and every year? More use-case specific derivatives?

All a format needs is 3 short examples, a language definition, and a link to an implementation.

Everything else lowers my expectation and its appeal.

link

charles_f 1703 days ago

a) why would you want to remove the field names, this is making it so much harder to debug and very brittle, since now you're dependent on the order of fields. No mention of how you handle versioning as well. Back to csv

> However, this time, something felt wrong; I realized that with the JSON, we were exchanging a huge amount of unnecessary information to and from the server

b) Text size really ain't an issue given that we're talking about typically just a few kb on gzipped protocols over hundreds of mbps connections. Compactness sounds like a bad argument to me.

c) "json doesn't have schema built in is a really dubious argument". If you want schemas you can still get them using json-schema, and if you don't you can still understand the message using the field names, which makes for a degraded schema ; which doesn't exist in the case of internet objects. If you don't have the schema, go figure what's in there

What really gives it to me is the comparison at the bottom between internet objects anf json; json looks better to me.

Looks like it's an idea executed over a bad premise

link

29athrowaway 1704 days ago

It is less human readable than JSON.

Human readibility is one of the most important aspects of JSON. Without that requirement you could use a binary serialization.

link

typingmonkey 1703 days ago

So the plain data is smaller because some information comes from the schema instead of the object. Guess what, you can do the same with json already [1]

[1] https://github.com/pubkey/jsonschema-key-compression

link

wly_cdgr 1704 days ago

Chuck Severance has a nice interview about JSON with Doug Crockford where Crockford argues that one of the main reasons his baby has been so successful is that it's unversioned. No new versions, no new features, no bloat, no compatibility issues

link

hiyer 1703 days ago

60% savings won't really count for much when the traffic is compressed, which is the case for most of JSON's uses. For real savings I think you'd have to go with a binary format like protobuf or thrift.

Edit: 50 -> 60

link

Waterluvian 1704 days ago

We are paying a cost in clarity, human editability, and further splintering of formats.

Everything is a trade off. So what do we get in trade for those rather large costs?

40% bandwidth savings might be worth it. But what are the gzipped comparisons?

link

tzekid 1703 days ago

Compression and decompression (gzip) takes computing power and RAM. The resulting JSON (in memory) is still harder to parse because of the required field names...

link

jmrm 1702 days ago

This could be a nice alternative, and even faster to parse than others, if indicating the type is mandatory. Also, this could help a bit to find errors.

link

kabes 1704 days ago

The example schema has:

  > age:{int, min:20}

Why would a data serialization format bother with data validation like the minimum value here?

link

williamtwild 1704 days ago

front and validation perhaps?

link

kabes 1703 days ago

Shouldn't be a concern of a serialization library

link

SV_BubbleTime 1703 days ago

It’s part of the schema part, not the serialization part right? I don’t disagree you parse then validate, but in a schema that defines type of data, it’s not unreasonable to put limits on values.

link

tzekid 1703 days ago

I'm still very confused of how you'd do nested data, or data with children elements with IO ... this just take us back in the CSV era

link

codeulike 1704 days ago

How are commas or speechmarks in strings escaped?

link

__m 1704 days ago

How do you have an array with different types of objects? You either have to repeat the schema or have to reference the schema.

link

barelysapient 1704 days ago

They lost me when they declared an address type.

Addresses are so varied in implementation and meaning that it’s frankly ridiculous.

link

antihero 1704 days ago

Is there much space saving after response compression?

link

HKH2 1704 days ago

What advantages do commas have over semicolons?

link

mofosyne 1703 days ago

Hope it supports semantic tagging like in CBOR

link

SV_BubbleTime 1703 days ago

Are you using your own tags in CBOR? What is the use case?

I figured that because I need to describe the tag, it was just as easy to not use tags and describes the elements that would make one up.

link

ishche 1704 days ago

Are comments allowed in this format?

link

SV_BubbleTime 1703 days ago

Good one, I’d also want to see hex format! It’s just a pain to show all integers as decimal in JSON.

link

galaxyLogic 1703 days ago

I would vote for that feature.

Also field-names which don't contain whitespace should not need to be quoted.

link

jFriedensreich 1703 days ago

when a project has more inspirational quotes that tech facts and relation to prior art thats often a red flag.also json is inherently schema less and non binary, this is not a flaw but critical for many usecases. if you want schemas there are many proven alternatives like protobuffs, avro, cap n proto, and message pack.

link

m0zg 1703 days ago

Protocol Buffers and other similar formats can already be serialized to/from text, and are also schema-first. This is solution in search of a problem.

link

cyberpsybin 1704 days ago

No.

link