|
|
|
|
|
by dmoura
1521 days ago
|
|
Please take this claim and these results with a pinch of salt. spyql was not created with the goal of being the fastest tool for querying data, and it might be the case that the same tools with different datasets or in different use-cases outperform spyql. There might also be other tools that I was not aware when I wrote the benchmark (I just learned about a new one that we will be adding to the benchmark). For me the lesson was that in certain problems (e.g. I/O intensive) the architecture/design might have a higher impact than the choice of the programming language. spyql can both leverage the python standard lib for parsing json (written in C) as well as orjson (written in Rust). In this benchmark we used the later, which shows considerable performance improvements. Still, query processing (expression evaluation, filtering, aggregations, etc) are implemented in Python. I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high perfomance on core functionalities. Here is a simple experiment with a 1GB file that shows that JSON decoding takes less than 40% of the processing time: !spyql "SELECT avg_agg(json->overall) FROM orjson" < books.json
avg_agg_overall
4.31181166791025
time: 11.7 s (started: 2022-04-13 23:37:07 +00:00)
import orjson as json
acc = 0
cnt = 0
with open('books.json') as f:
for line in f:
acc += json.loads(line)['overall']
cnt += 1
print(acc/cnt)
4.31181166791025
time: 4.55 s (started: 2022-04-13 23:37:19 +00:00)
|
|
11.7s puts you at one order of magnitude off, which could be a fair price to pay if you never need this for large datasets (100s of GB or TB of data you want to query).
And the reason we use wrapped libraries in Python so often is because it’s abysmally slow to do anything in the interpreter. The average loop is 100x slower than it should be. The more math you do the worse it gets too. Most pure Python code is 1000x slower than it should be.