Hacker News new | ask | show | jobs
by hamandcheese 754 days ago
> you query rows from local db.

But the data is still remote (in object storage) right? If I understand correctly, this works then the first solution because parquet is a much more efficient format to query?

3 comments

Long story short, you could either 1) query specific columns using s3-parquet-duckdb stack 2) load parquet file through network, and put it inside local duckdb-wasm instance so that you can do queries from client side
My comment was a bit ambiguous. So, for sheets where we have to load all data, we would load all columns at once as a parquet file. (I will leave comment for the advantage of this approach in the next comment)

On the other hand, let’s say we have to draw a chart from a column. The type chart could be changed by user - they could be Pie charts, means, time series chart, median, table or even dot products. To achieve this goal, we would bring just a column from s3 using duckdb, and apply sql queries from client side, rendering adequate ui.

It's probably part of it, but also overhead from small requests and latency from round trips.
Great point.

The advantages of loading “parquet” in “client side” are that 1) you only have to load data once from server and 2) the parquet files are surprisingly well zipped.

1) If you load once from server, no more small network requests while you are scrolling a table. Moreover, you could use the same duckdb table to visualize data or show raw data.

2) Sending whole data as a parquet file is faster through network than receiving data as json in response.

I wonder how much of the benefit is from just not using json vs reducing round trips. I guess if you had a large table you could stream (smaller than normal) row groups from parquet? Not sure how much work that would be though.

I'm not sure what the optimal response size is for an http response, but probably there are diminishing efficiency returns above more than a MB or two, and more of a latency hit for reading the whole file. So if you used row groups of a couple of MB and then streamed them in you'd probably get the best of both worlds.