Hacker News new | ask | show | jobs
by chubot 2677 days ago
I agree that's a good strategy for big JSON. Do you know of any such "lazy" parsers?

I think the problem is that to extract arbitrary keys, you really need to parse the whole thing, although you don't need to materialize nodes for the whole thing.

But if you have big JSON with a given schema, you may be able to skip things lexically. You basically need to count {} and [], while taking into account " and \ within quoted strings.

That doesn't seem too hard. I think a tiny bit of http://re2c.org/ could do a good job of it.

5 comments

For node.js, I wrote a lib that can selectively parse JSON subtrees:

https://gitlab.com/philbooth/bfj

The specific function of interest here is `bfj.match`, which takes a readable stream and a selector as arguments:

https://gitlab.com/philbooth/bfj#how-do-i-selectively-parse-...

It still walks the full tree like a regular parser, but just avoids creating any data items unless the selector matches. Though there is an outstanding issue to support JSONPath in the selector, currently it only matches individual keys and values.

It’s not exactly the lazy parser you describe, but Sparser[1] builds filters to exclude json lines/files that can’t contain what you’re looking for, and only parses those that might.

The Morning Paper’s writeup[2] from last year provides a good summary

[1]: http://www.vldb.org/pvldb/vol11/p1576-palkar.pdf [2]: https://blog.acolyer.org/2018/08/20/filter-before-you-parse-...

This work is somewhat orthogonal to ours as it assumes that you can locate JSON records without doing parsing; if I remember correctly, it groups JSON records as lines. If your JSON has been formatted to conform to this, I suppose it would be quite effective.
That's what our first stage does, pretty much. I would imagine we do it way faster than re2c would do it.

Parsing the entire document lock stock and barrel is an easier thing to write about and benchmark. The problem is with skipping around and pulling out bits of JSON from a benchmarking framework is that attempting to present such data often amounts to "hey, we asked ourselves a question and then we got a really good answer for it!". It's hard to picture what a 'typical' query for some field over a JSON document would look like. Conversely, it's pretty easy to know when you finished parsing the Whole Thing.

> It's hard to picture what a 'typical' query for some field over a JSON document would look like.

Exactly. A "query" would have to define not only the path, type of the field in the source data but also the type/interface of where you want to put that data. Combining dynamic queries and typed data you get a fairly tricky problem, which is why I said this is tricky. I worked on a similar thing for protobuf and jitting was a solution I looked into (in that project libllvm was too unwieldy to use).

I'm not sure what you mean by arbitrary? What parsing in this case means e.g. turning a string of digits into a ieee754 float number in memory. I think this project is meant to accelerate this part with SIMD, but a greater improvement can be obtained by simply not doing this for as much data as possible. If the actual materialized data constitutes a small part of the original, there should be ways to do minimum work for the rest.
In jvm-land circe-fs2[1] is a streaming Parser.

[1]: https://github.com/circe/circe-fs2/blob/master/README.md