| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jammycrisp 1566 days ago

> I should mention that spyql leverages orjson, which has a considerable impact on performance

Even with orjson, you're still paying the cost of creating a new PyObject for every node in the JSON blob. orjson is well engineered (as is the backing serde-json decoder), but any JSON decoder that isn't using naive algorithms is mostly bound by the cost of creating PyObjects. Allocating in Python is _slow_.

I wrote a quick benchmark (https://gist.github.com/jcrist/de29815389eaed4eaf5b24fbcfdab...) showing a handwritten query that accesses only a few fields in a 13 MiB JSON file. The same query is repeated with a number of different Python JSON libraries. Results:

    $ python bench_repodata_query.py 
    msgspec: 45.018014032393694 ms
    simdjson: 61.94157397840172 ms
    orjson: 105.34720402210951 ms
    ujson: 121.9699690118432 ms
    json: 113.79130696877837 ms

While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.

If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.