Hacker News new | ask | show | jobs
by mannyv 958 days ago
If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.

2 comments

Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?

I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).

> If so, how do you deal with repeated keys in "hash tables"?

depending on the parser, behaviour might differ. But looking at https://stackoverflow.com/questions/21832701/does-json-synta... , it seems like the "best" option is to have 'last key wins' as the resolution.

This works fine under a SAX like interface in a streaming JSON parser - your 'event handler' code will execute for a given key, and a 2nd time for the duplicate.

> This works fine

This is a very strange way of using the word "fine"... What if the value that lives in the key triggers some functionality in the application that should never happen due to the semantics you just botched by executing it?

Example:

    {
      "commands": {
        "bumblebee": "rm -rf /usr",
        "bumblebee": "echo 'I have done nothing wrong!'"
      }
    }
With the obvious way to interpret this...

So, you are saying that it's "fine" for an application to execute the first followed by second, even though the semantics of the above are that only the second one is the one that should have an effect?

Sorry, I have to disagree with your "works fine" assessment.

you're layering the application semantics into the transport format.

It's fine, in the sense that a JSON with duplicate keys is already invalid - but the parser might handle it, and i suggested a way (just from reading the stackoverflow answer).

It's the same "fine" that you get from undefined C compiler behaviour.

Why do you keep inventing stuff... No, JSON with duplicate keys is not invalid. The whole point of streaming is to be able to process data before it completely arrived. What "layering semantics" are you talking about?

This has no similarity with undefined behavior. This is documented and defined.

A JSON object with duplicate keys is explicitly defined by the spec as undefined behavior, and is left up to the individual implementation to decide what to do. It's neither valid nor invalid.
last key wins is terrible advice and has serious security implications.

see https://bishopfox.com/blog/json-interoperability-vulnerabili... or https://www.cvedetails.com/cve/CVE-2017-12635/ for concrete examples where this treatment causes security issues.

the https://datatracker.ietf.org/doc/html/rfc7493 defines a more strict format where duplicate keys are not allowed.

Last key wins is the most common behavior among widely-used implementations. It should be assumed as the default.
I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.
RAM is always expensive no matter where you get it from.

Would you rather do two hours of work or force thousands of people to buy more RAM because your library is a memory hog?

And on embedded systems RAM is a premium. More RAM = most cost.