Hacker News new | ask | show | jobs
by dmoura 1521 days ago
Please take this claim and these results with a pinch of salt. spyql was not created with the goal of being the fastest tool for querying data, and it might be the case that the same tools with different datasets or in different use-cases outperform spyql. There might also be other tools that I was not aware when I wrote the benchmark (I just learned about a new one that we will be adding to the benchmark).

For me the lesson was that in certain problems (e.g. I/O intensive) the architecture/design might have a higher impact than the choice of the programming language.

spyql can both leverage the python standard lib for parsing json (written in C) as well as orjson (written in Rust). In this benchmark we used the later, which shows considerable performance improvements. Still, query processing (expression evaluation, filtering, aggregations, etc) are implemented in Python. I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high perfomance on core functionalities.

Here is a simple experiment with a 1GB file that shows that JSON decoding takes less than 40% of the processing time:

    !spyql "SELECT avg_agg(json->overall) FROM orjson" < books.json

    avg_agg_overall
    4.31181166791025 
    time: 11.7 s (started: 2022-04-13 23:37:07 +00:00)


    import orjson as json
    acc = 0 
    cnt = 0 
    with open('books.json') as f: 
      for line in f: 
        acc += json.loads(line)['overall'] 
        cnt += 1 
    print(acc/cnt)

    4.31181166791025
    time: 4.55 s (started: 2022-04-13 23:37:19 +00:00)
4 comments

simdjson can load the json into memory in a queryable form in ~1/3 of a second. So you can save yourself basically 40% of the runtime right there. Computing average should take less than 1/2 a second on modern hardware (assumes <10 million books). So back-of-envelope target speed should be less than 1 second for this benchmark.

11.7s puts you at one order of magnitude off, which could be a fair price to pay if you never need this for large datasets (100s of GB or TB of data you want to query).

And the reason we use wrapped libraries in Python so often is because it’s abysmally slow to do anything in the interpreter. The average loop is 100x slower than it should be. The more math you do the worse it gets too. Most pure Python code is 1000x slower than it should be.

The example of processing and querying a 1GB "JSON Lines" [1] file, where each line is a json document 0.1-10KB in size with a varying schema on every line is a very common use case in data engineering. On top of that, there are additional constraints where we might only be allowed to allocate 1vCPU to the task, there's additional IO overhead of downloading the file from S3 and finally, even though there's TBs of the same data we only ever need to process a few GBs per hour or day. How well can simdjson perform under these circumstances [2]? Probably quite well but not as fast as having to serialize a single 1TB json file.

So my metrics of success in this scenario are based on that fact that I have to deal with 10-100 such queries in a project in my day job, so I would choose SpyQL to write and maintain a simple and readable 5 line query in under 5 minutes with decent performance to solve a trivial use case of computing an average.

P.S. I know the article is about performance and your response about Python being is slow is beyond accurate and yet I will always choose to use it because it is not ashamed to sit on the shoulders of the fast and ugly.

[1] https://jsonlines.org/

[2] https://github.com/simdjson/simdjson/blob/master/doc/iterate...

Thank you!!
psyql is amazing, it's readme helped me to quickly grasping its value proposition and usage. immediately recommended to a colleague who is an SQL and CLI aficionado. the psyql repo and matplotcli should have many more github stars.
Thank you!! :-D
> I guess it's in the nature of Python to leverage internal/external modules written in a statically-typed compiled language to deliver high performance on core functionalities.

I'm aware of python leveraging compiled C or even FORTRAN to run certain functions, does it do message passing with a forked process or something else? I'm having a hard time googling for how it works since I don't know what it's called.

Are you familiar with foreign function interfaces? [0] Threading can be handled Either by the host program or the C / Fortran code.

[0] https://docs.python.org/3/extending/extending.html

Feels a little disingenuous to highlight that it's written in Python when actually Rust is doing the heavy lifting.