|
|
|
|
|
by Olreich
1523 days ago
|
|
simdjson can load the json into memory in a queryable form in ~1/3 of a second. So you can save yourself basically 40% of the runtime right there. Computing average should take less than 1/2 a second on modern hardware (assumes <10 million books). So back-of-envelope target speed should be less than 1 second for this benchmark. 11.7s puts you at one order of magnitude off, which could be a fair price to pay if you never need this for large datasets (100s of GB or TB of data you want to query). And the reason we use wrapped libraries in Python so often is because it’s abysmally slow to do anything in the interpreter. The average loop is 100x slower than it should be. The more math you do the worse it gets too. Most pure Python code is 1000x slower than it should be. |
|
So my metrics of success in this scenario are based on that fact that I have to deal with 10-100 such queries in a project in my day job, so I would choose SpyQL to write and maintain a simple and readable 5 line query in under 5 minutes with decent performance to solve a trivial use case of computing an average.
P.S. I know the article is about performance and your response about Python being is slow is beyond accurate and yet I will always choose to use it because it is not ashamed to sit on the shoulders of the fast and ugly.
[1] https://jsonlines.org/
[2] https://github.com/simdjson/simdjson/blob/master/doc/iterate...