Hacker News new | ask | show | jobs
by ramraj07 595 days ago
I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.

To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.

2 comments

Where does your data reside, is it on an attached EBS volume, or in S3, or somewhere else?

I had some spare time and tinkered with duckdb with a 70GB dataset, but just getting the 70GB on to the EC2 took hours. Would be pretty rocking if duckdb team could somehow set up a ~1TB sized demo that anyone can setup and try for themselves in, say, under an hour.

Local drives. DONT USE EBS! you’ll incur a huge IO charge. You have to choose instances with attached nvme storage which means one of the storage optimized instances.

Reading the data off s3 will mean you will be slower than offerings like snowflake. Snowflake has optimized the crap out of doing analytics in s3, so you can’t beat it with something as simple as duckdb.

Importantly you need the data in some distributed format like parquet or split csv. Otherwise duckdb can’t read it in parallel.

Hi – DuckDB Labs devrel here. It's great that you find DuckDB useful!

On the setup side, I agree that local (instance-attached) disks should be preferred but does EBS incur an IO fee? It incurs a significant latency for sure but it doesn't have a per-operation pricing:

> I/O is included in the price of the volumes, so you pay only for each GB of storage you provision.

(https://aws.amazon.com/ebs/pricing/)

Can’t remember anymore, but it’s either (a) the gp2 volumes were way too slow for the ops or (b) the IOPs charges made it bad. To be clear I didn’t do it on duckdb but hosted a Postgres. I moved to light sail instead and was happy with it (you don’t get attached SSD in ec2 until you go to instances that are super large).
Also, I learned that Hive-partitioned Parquet on S3 is much slower than on disk.

S3 is high latency unless you use for S3 Express Zones (the low latency version).

We used EFS (not EBS) and it was much faster.

Test out the nvme drives though. It’s blazing.
I tried to spread large dataset into thousands of files on S3 and use StepFunctions Distributed Map to launch thousands of Lambda instances to process those files in parallel, using DuckDB (or other libs) in Lambda. The parallel loading and processing is way faster than doing this in a single big EC2 instance.
Lambda isn’t infinitely parallel. I thought it doesn’t do more than 100 parallel runners? I4i.metal has 96 cores and can be faster than that.
As per AWS said in https://aws.amazon.com/cn/blogs/aws/aws-lambda-functions-now...

> Each synchronously invoked Lambda function now scales by 1,000 concurrent executions every 10 seconds.

I’ve tried reading streamed parquet via PyArrow with Duck, and it’s been pretty promising. Depending on the query, you won’t need to download everything off HTTP.
we use partitioned parquet files in s3. we use a csv in the bucket root to track the files. i’m sure there’s a better way but for now the 2tb of data are stored cheaply and we get fast reads by only reading the partitions we need to read.
I'm curious how much simpler to build, manage, and run vs cost it would be to simply running a database on a large vultr/DO instance and paying for 2tb of storage?

I feel like you'd get away with the whole thing for around $500/mo depending on how much compute was needed?

You just need to try it once to see the issue. Merely loading this amount of data onto a Postgres db will be hell.
well that's not the infrastructure we have. we are primarily an aws shop so we use the resources available to us in the context of our infrastructure decisions. it would be a hard sell to buy something outside of that ecosystem.
I understand that's the infrastructure you have. But that's more describing vendor lock-in haha.

Most of my work is with clients that don't have any set infrastructure yet, so was curious if anyone had any anecdotes.

Huge fan of Clickhouse, but the minute you have to deal with somebody else's CSV is when Duck wins over Clickhouse.