Hacker News new | ask | show | jobs
by kccqzy 2674 days ago
I guess the question is, what do you parse it to? I'm guessing definitely not turning objects into std::unordered_map and arrays into std::vector or some such. So how easy it is to use the "parsed" data structure? How easy is it to add an element to some deeply nested array for example?
4 comments

The ParsedJson type is immutable and accessed mutating iterators (up and down the tree, forward and backward through members and indices).

My immediate thought is to compare it to rapidjson, which I've used before. The paradigm of mutating iterators seems awkward at first but should be just as powerful as rapidjson's Value. For example, both approaches end up doing a linear scan to find an object member by name.

The fact that rapidjson supports mutation of Values and simdjson does not has huge implications (as mentioned in the simdjson README scope section), I suspect this tradeoff explains most of the performance differences as I know rapidjson also uses simd internally.

Is there a reason these fast json libraries seem to favor doing linear scan for object representation?
Faster to build than a hash map, less code (which is also better for icache), etc.

JSON Objects tend to have few enough values that it doesn't matter a ton anyway.

The data is put into a "ParsedJson" object: https://github.com/lemire/simdjson/blob/master/include/simdj...
That header mentions a tape.md describing the format. It's really interesting:

https://github.com/lemire/simdjson/blob/master/tape.md

I can't speak for this project, but my own for CSV files ( https://github.com/dw/csvmonkey ) provides a high level interface that allows the tokenized data to be manipulated in-place without full decoding. The interface exported in Python is that of a plain old dictionary with one added magical semantic (lazy decode on element access). The internal representation of the parse result is a simple fixed array of (ptr, size) pairs

Methods like this are used for batch search / summation where only a fraction of the parsed data is actually relevant during any particular run. You'll find similar approaches used in e.g. the row format parser of a database like MongoDB or Postgres

into a token stream?
Isn't that just lexing?