Hacker News new | ask | show | jobs
by memco 469 days ago
Love this straightforward analysis of use cases:

> Using smallpond and 3FS depends largely on your data size and infrastructure:

> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.

> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.

> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.

Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.

2 comments

I very much felt like that entire portion of the article was ai generated, actually.

IMO pretty obvious, surface level, information and some prose on each bullet.

Saying something is “obvious” without specifying an audience is meaningless.

(because obviousness is subjective and depends on the knowledge, experience, and context of the audience)

Notice the “IMO pretty” before the word “obvious”

IMO means “in my opinion.” I used that phrase to express how the following statement is my opinion and not a universal truth. My “audience” in this case is myself.

I do that because otherwise there’s always a comment saying how things like “obvious” can be subjective.

I also used the word “pretty” to, again, soften the word “obvious” so that readers don’t think that it’s a universal truth.

with some "no s, sherlock" on the ">1PB will require additional infra."

go on...

like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!

Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.

I updated the post. In this case, I meant "exotic" infra... e.g. 3FS isn't like adding more EC2 instances.

Adding ec2 instances is trivial, setting up 3FS is hard.

You’ve been wanting to get this off your chest for a while haven’t you.
The authors are Chinese so they may simply use AI to make it sound right in English
I had a Chinese co-worker and something like this was actually his style of writing, no use of AI, because I was sitting next to him few times when he was writing documents.
some was AI generated, but I made sure everything was accurate. I'd normally rewrite everything, but I wrote this quickly before I had to leave the house. Didn't think it'd be on the front page!
Not judging you for using AI for a post like this!

Don’t feel bad. I just didn’t think AI generated bullet points were as impressive as the comment I was replying to did.

I wonder at which scale spark fits into this picture and what the tradeoffs / benefits would be
spark is certainly the incumbent for this sort of thing.

one benefit for me personally: you should be able to move from local dev to cloud more easily.

Yeah I reeeaaally want to see benchmarks! Single node duckdb is absolutely insane (as in fast) performance wise, especially compared to something like Spark. There's been a lot of speed focussed work in the project and I don't know of any faster data processing (I'm not counting traditional SQL since a lot of the speed benefits there come from indexing etc and essentially doing additional work ahead of time).

I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.

My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .