| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by youngbum 791 days ago

This is the exact reason we applied duckdb and duckdb-wasm into our service.

Our team is currently building a form builder SaaS. Most forms have responses under 1,000, but some of them would have more than 50,000 responses.

So, when user tries to explore through all responses in our “response sheet” feature, usually they could be loaded via infinite scrolling (load as they scroll).

This uses up to 100MB of network in total if they had to get object arrays of 50,000 rows of data with 50 columns.

That was where duckdb kicked in : just store the responses into S3 as parquet file(in our case Cloudflare R2).

Then, load the whole file into duckdb-wasm into client. So when you scroll through sheet, instead of getting rows from server, you query rows from local db.

This made our sheet feature very efficient and consistent in terms of their speed and memory usage.

If network speed and memory is your bottle neck when loading “medium” data into your client, you definitely should give it a try.

PS. If you have any questions, feel free to ask!

PS. Our service is called Walla, check it out at https://home.walla.my/en

4 comments

tobilg 791 days ago

I'm currently rewriting https://github.com/ownstats/ownstats to this model, with a slight difference that I stream Arrow data from a AWS Lambda Function URL into DuckDB WASM in the frontend... Works great.

An improvement could be having pre-calculated DuckDB database files that are directly attached from the DuckDB WASM frontend, see https://duckdb.org/docs/guides/network_cloud_storage/duckdb_...

link

hamandcheese 791 days ago

> you query rows from local db.

But the data is still remote (in object storage) right? If I understand correctly, this works then the first solution because parquet is a much more efficient format to query?

link

youngbum 791 days ago

Long story short, you could either 1) query specific columns using s3-parquet-duckdb stack 2) load parquet file through network, and put it inside local duckdb-wasm instance so that you can do queries from client side

link

youngbum 791 days ago

My comment was a bit ambiguous. So, for sheets where we have to load all data, we would load all columns at once as a parquet file. (I will leave comment for the advantage of this approach in the next comment)

On the other hand, let’s say we have to draw a chart from a column. The type chart could be changed by user - they could be Pie charts, means, time series chart, median, table or even dot products. To achieve this goal, we would bring just a column from s3 using duckdb, and apply sql queries from client side, rendering adequate ui.

link

foota 791 days ago

It's probably part of it, but also overhead from small requests and latency from round trips.

link

youngbum 791 days ago

Great point.

The advantages of loading “parquet” in “client side” are that 1) you only have to load data once from server and 2) the parquet files are surprisingly well zipped.

1) If you load once from server, no more small network requests while you are scrolling a table. Moreover, you could use the same duckdb table to visualize data or show raw data.

2) Sending whole data as a parquet file is faster through network than receiving data as json in response.

link

foota 791 days ago

I wonder how much of the benefit is from just not using json vs reducing round trips. I guess if you had a large table you could stream (smaller than normal) row groups from parquet? Not sure how much work that would be though.

I'm not sure what the optimal response size is for an http response, but probably there are diminishing efficiency returns above more than a MB or two, and more of a latency hit for reading the whole file. So if you used row groups of a couple of MB and then streamed them in you'd probably get the best of both worlds.

link

LunaSea 791 days ago

You wouldn't need DuckDB for this, you can simply store the parquet file in S3 and read them using a parquet NPM package.

link

youngbum 791 days ago

Exactly.

We have also tried arrow js or parquet wasm, and they were much lighter than duckdb wasm worker.

DuckDb however was useful in our case, considering our nature as form builder service, we had to provide features for statistics. It was cool to have OLAPS inside a webworker that could handle (as far as we checked) more than 100,000 rows at ease.

link

LunaSea 791 days ago

I'm still unconvinced.

A regular JavaScript array can also handle 100k object rows very fast.

link

just_testing 789 days ago

this is legit awesome!

link

laurels-marts 791 days ago

So you have duckdb running on the server (e.g. node.js) and duckdb-wasm running on the client? Or are you hitting S3 directly with duckdb-wasm?

link