Hacker News new | ask | show | jobs
by xfs 2677 days ago
If you're working with json objects with sizes on the higher end quite often you're not going to need the entirety of them, just a small part of them. If that is the workload what then to do is simply parse as little data as possible: skip the validation, locate the relevant bits, and then start parsing, validation and all the stuff. In this optimizing the json scanner/lexer gives much greater improvement than optimizing the parser.

Though this job is trickier than it may look. The logic to extract the "relevant" bits is often dynamic or tied to user input but for the scanner/lexer to be ultrafast it has to be tightly compiled. You can try jitting but libllvm is probably too heavyweight for parsing json.

3 comments

Jitting is a common tool that people seem to reach for whenever they are parsing or lexing anything at any time. It's really not necessary; there are plenty of fast search methods out there.

JIT approaches make a lot of sense for lex/yacc and their numerous descendants, as these typically need to put a lot of extra logic into the process of parsing. You don't need to JIT just to look up some strings and/or parse a fairly simple hierarchical structure.

Parsing itself doesn't need jitting but as soon as you start to use the parsed data to interface with some typed containers the data plumbing consumes much more time than parsing does and drags down all the optimization. For parsing to interact well with static languages jitting is a possible solution to look at.
I agree that's a good strategy for big JSON. Do you know of any such "lazy" parsers?

I think the problem is that to extract arbitrary keys, you really need to parse the whole thing, although you don't need to materialize nodes for the whole thing.

But if you have big JSON with a given schema, you may be able to skip things lexically. You basically need to count {} and [], while taking into account " and \ within quoted strings.

That doesn't seem too hard. I think a tiny bit of http://re2c.org/ could do a good job of it.

For node.js, I wrote a lib that can selectively parse JSON subtrees:

https://gitlab.com/philbooth/bfj

The specific function of interest here is `bfj.match`, which takes a readable stream and a selector as arguments:

https://gitlab.com/philbooth/bfj#how-do-i-selectively-parse-...

It still walks the full tree like a regular parser, but just avoids creating any data items unless the selector matches. Though there is an outstanding issue to support JSONPath in the selector, currently it only matches individual keys and values.

It’s not exactly the lazy parser you describe, but Sparser[1] builds filters to exclude json lines/files that can’t contain what you’re looking for, and only parses those that might.

The Morning Paper’s writeup[2] from last year provides a good summary

[1]: http://www.vldb.org/pvldb/vol11/p1576-palkar.pdf [2]: https://blog.acolyer.org/2018/08/20/filter-before-you-parse-...

This work is somewhat orthogonal to ours as it assumes that you can locate JSON records without doing parsing; if I remember correctly, it groups JSON records as lines. If your JSON has been formatted to conform to this, I suppose it would be quite effective.
That's what our first stage does, pretty much. I would imagine we do it way faster than re2c would do it.

Parsing the entire document lock stock and barrel is an easier thing to write about and benchmark. The problem is with skipping around and pulling out bits of JSON from a benchmarking framework is that attempting to present such data often amounts to "hey, we asked ourselves a question and then we got a really good answer for it!". It's hard to picture what a 'typical' query for some field over a JSON document would look like. Conversely, it's pretty easy to know when you finished parsing the Whole Thing.

> It's hard to picture what a 'typical' query for some field over a JSON document would look like.

Exactly. A "query" would have to define not only the path, type of the field in the source data but also the type/interface of where you want to put that data. Combining dynamic queries and typed data you get a fairly tricky problem, which is why I said this is tricky. I worked on a similar thing for protobuf and jitting was a solution I looked into (in that project libllvm was too unwieldy to use).

I'm not sure what you mean by arbitrary? What parsing in this case means e.g. turning a string of digits into a ieee754 float number in memory. I think this project is meant to accelerate this part with SIMD, but a greater improvement can be obtained by simply not doing this for as much data as possible. If the actual materialized data constitutes a small part of the original, there should be ways to do minimum work for the rest.
In jvm-land circe-fs2[1] is a streaming Parser.

[1]: https://github.com/circe/circe-fs2/blob/master/README.md

Depending on usecase, the JSON lines format can make this into a pretty simple task! Obviously has to fit in with one's data structure though.