| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kccqzy 2674 days ago
	I guess the question is, what do you parse it to? I'm guessing definitely not turning objects into std::unordered_map and arrays into std::vector or some such. So how easy it is to use the "parsed" data structure? How easy is it to add an element to some deeply nested array for example?

4 comments

Falell 2674 days ago

The ParsedJson type is immutable and accessed mutating iterators (up and down the tree, forward and backward through members and indices).

My immediate thought is to compare it to rapidjson, which I've used before. The paradigm of mutating iterators seems awkward at first but should be just as powerful as rapidjson's Value. For example, both approaches end up doing a linear scan to find an object member by name.

The fact that rapidjson supports mutation of Values and simdjson does not has huge implications (as mentioned in the simdjson README scope section), I suspect this tradeoff explains most of the performance differences as I know rapidjson also uses simd internally.

link

hnaccy 2674 days ago

Is there a reason these fast json libraries seem to favor doing linear scan for object representation?

link

yoklov 2674 days ago

Faster to build than a hash map, less code (which is also better for icache), etc.

JSON Objects tend to have few enough values that it doesn't matter a ton anyway.

link

saagarjha 2674 days ago

The data is put into a "ParsedJson" object: https://github.com/lemire/simdjson/blob/master/include/simdj...

link

scottlamb 2674 days ago

That header mentions a tape.md describing the format. It's really interesting:

https://github.com/lemire/simdjson/blob/master/tape.md

link

_wmd 2674 days ago

I can't speak for this project, but my own for CSV files ( https://github.com/dw/csvmonkey ) provides a high level interface that allows the tokenized data to be manipulated in-place without full decoding. The interface exported in Python is that of a plain old dictionary with one added magical semantic (lazy decode on element access). The internal representation of the parse result is a simple fixed array of (ptr, size) pairs

Methods like this are used for batch search / summation where only a fraction of the parsed data is actually relevant during any particular run. You'll find similar approaches used in e.g. the row format parser of a database like MongoDB or Postgres

link

AtlasBarfed 2673 days ago

into a token stream?

link

wtetzner 2673 days ago

Isn't that just lexing?

link