|
SAX is a push parser, presumably this is on top of a pull API like StAX. The Jakarta JSON streaming API sort of gets at this (sort of): https://jakarta.ee/specifications/platform/9/apidocs/jakarta... The basic interface to a JSON document is something like an iterator, which lets you advance through the document, token by token, and read out values when you encounter them. So if you have an array of objects with x and y fields, you read a start of array, start of object, key "x", first x value, key "y", first y value, end of object, start of object, key "x", second x value, key "y", second y value, end of object, etc. Reading tokens, not anything tree/DOM-like. But there are also methods getObject() and getArray(), which pull a whole structure out of the document from wherever the iterator has got to. So you could read start of array, read object, read object, etc. That lets you process a document incrementally, without having to materialise the whole thing as a tree, but still having a nice tree-like interface at the leaves. In principle, you could implement getObject() and getArray() in a way which does not eagerly materialise their contents - each node could know a range in a backing buffer, and parse contents on demand. But i don't think implementations actually do this. Wrapping a tree-like interface round incremental parsing that doesn't require eager parsing or retaining arbitrary amounts of data, and doesn't leak implementation details, sounds astoundingly hard, perhaps even impossible. But then i am not Daniel Lemire. And i have not read the paper. |
I don't think they promise this and I suspect this fails to parse some pathological but correct JSON files, eg one that starts with 50 GB of [s.