| HN Mirror

The example of processing and querying a 1GB "JSON Lines" [1] file, where each line is a json document 0.1-10KB in size with a varying schema on every line is a very common use case in data engineering. On top of that, there are additional constraints where we might only be allowed to allocate 1vCPU to the task, there's additional IO overhead of downloading the file from S3 and finally, even though there's TBs of the same data we only ever need to process a few GBs per hour or day. How well can simdjson perform under these circumstances [2]? Probably quite well but not as fast as having to serialize a single 1TB json file.

So my metrics of success in this scenario are based on that fact that I have to deal with 10-100 such queries in a project in my day job, so I would choose SpyQL to write and maintain a simple and readable 5 line query in under 5 minutes with decent performance to solve a trivial use case of computing an average.

P.S. I know the article is about performance and your response about Python being is slow is beyond accurate and yet I will always choose to use it because it is not ashamed to sit on the shoulders of the fast and ugly.

[1] https://jsonlines.org/

[2] https://github.com/simdjson/simdjson/blob/master/doc/iterate...