| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cjonas 82 days ago
	We just create mini data "ponds" on the fly by copying tenant isolated gold tier data to parquet in s3. The users/agent queries are executed with duckdb. We run this process when the user start a session and generate an STS token scoped to their tenant bucket path. Its extremely simple and works well (at least with our data volumes).

5 comments

leetrout 82 days ago

I built something on top of DuckDB last year but it never got deployed. They wanted to trust Postgres.

I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI. Just filtered out the write / delete statements in a rudimentary way.

DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.

And "data pond"? Glad I am not alone using this term! Somewhere between a data lake and warehouse - still unstructured but not _everything_ in one place. For instance, if I have a multi-tenant app I might choose to have a duckdb setup for each customer with pre-filtered data living alongside some global unstructured data.

Maybe there's already a term that covers this but I like the imagery of the metaphor... "smaller, multiple data but same idea as the big one".

link

mattaitken 82 days ago

This is cool. I think for our use case this wouldn’t work. We’re dealing with billions of rows for some tenants.

We’re about to introduce alerts where users can write their own TRQL queries and then define alerts from them. Which requires evaluating them regularly so effectively the data needs to be continuously up to date.

link

SOLAR_FIELDS 82 days ago

Billions still seems crunchable for DDB. It’s however much you can stuff into your RAM no? Billions is still consumer grade machine RAM depending on the data. Trillions I would start to worry. But you can have a super fat spot instance where the crunching happens and expose a light client on top of that then no?

Quadrillions, yeah go find yourself a trino spark pipeline

link

Waterluvian 82 days ago

Is that why it’s called DuckDb? Because data ponds?

link

QuantumNomad_ 82 days ago

The DuckDB website has the following to say about the name:

> Why call it DuckDB?

> Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.

https://duckdb.org/faq#why-call-it-duckdb

link

cjonas 82 days ago

Idk but I named everything in the related code "duckpond" :)

link

mritchie712 82 days ago

Hannes (one of the creators) had a pet duck

link

otterley 82 days ago

How large are these data volumes? How long does it take to prepare the data when a customer request comes in?

link

cjonas 82 days ago

Small. We're dealing with financial accounts, holdings and transactions. So a user might have 10 accounts, thousands of holdings, 10s of thousands of transactions. Plus a handful of supplemental data tables. Then there is market data that is shared across tenants and updated on interval. This data is maybe 10-20M rows.

Just to clarify, the data is prepared when the user (agent) analytics session starts. Right now it takes 5-10s, which means it's typically ready well before the agent has actually determined it needs to run any queries. I think for larger volumes, pg_duckdb would allow this to scale to 10s of millions rows pretty efficiently.

link

boundlessdreamz 82 days ago

How do you copy all the relevant data? Doesn't this create unnecessary load on your source DB?

link

cjonas 82 days ago

We have various data sources (which is another benefit of this approach). Data from the application DB is currently pulled using the FE apis which handle tenant isolation and allow the application database to deal with the load. I think pg_duckdb could be a good solution here as well, but haven't gotten around to testing it. Other data come from analytics DB. Most of this is landed on an interval via pipeline scripts.

link