| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ironchef 1032 days ago
	Here was my situation. Occasional queries. Over a couple petabyte of data. Customer facing so response in seconds per SLA but > 95 percent of the time the warehouse isn’t running. Cached queries from within 24 hours which don’t require the warehouse to even spin up. Our snowflake costs were significantly less than an FTE. Would that potentially be a situation which “running your own” doesn’t make sense?

2 comments

ramesh31 1032 days ago

>Would that potentially be a situation which “running your own” doesn’t make sense?

Look into datalake architectures. RDBMS based data warehousing is obviously not economical at the petabyte scale. But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

link

ironchef 1031 days ago

> Look into datalake architectures.

Yup .. comfy with iceberg/delta/hudi

> RDBMS based data warehousing is obviously not economical at the petabyte scale.

I never said it was .. I'm simply responding to "I simply cannot understand how anyone chooses this over running your own Spark clusters with Jupyterlab". I'm trying to help you understand why folks would choose a SaaS over run your own.

> But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

No. You don't just pay for object storage + minor S3 read costs.

You pay for operations You pay for someone setting up conventions You pay to not have to optimize data layouts for streaming writes You pay to not have to discover race conditions in s3 when running multiple spark clusters writing to same delta tables You pay to not have to discover that your partitions/clustering needs have changed based on new data or query patterns

But look .. I get it. You have chosen to optimize for cost structures in one way .. and I've chosen to optimize in a different way. In the past I've done exactly as you've said as well. I think being able to seeking to see _why_ folks may have chosen a different path may help you understand other areas to consider in operations.

link

agent281 1032 days ago

If you have petabytes of data, I don't think this article is talking about your use case.

link

sanderjd 1032 days ago

I think it is?

Or I guess, what data size do you think it's talking about? If you only have gigabytes of data, none of this matters, you can use anything pretty cheaply and easily. So is this article just for "terabytes" or does it go up to "hundreds of terabytes" but not "petabytes"?

link

agent281 1031 days ago

Hmm, I suppose it's a bit challenging to say. I initially thought that it wasn't for the 80% smallest companies and petabytes of data is probably puts you in the top 20%. (Most businesses are small businesses after all.)

However, I now realize that th biggest companies probably should manage their own data. If you're Google why would you use Snowflake?

So I don't know if you are the target audience for this blog post. It's pretty ambiguous.

link

sanderjd 1031 days ago

I guess I'll say what I think. I do think it is targeted at that smallest 80% of companies with some digital footprint, and also at most of the top 20%. Or more specifically, I think maybe it's targeted at like the 5th percentile to the 99th percentile. That bottom 5% probably just needs Excel, and that top 1% is probably writing or heavily modifying all their own tools.

But I'm not sure the advice is very good from the 5th percentile up to ... maybe that top 20%? A lot of the stuff in the article assumes the availability of sophisticated data architects and mature infrastructure groups that I really don't think the median company has.

link

agent281 1031 days ago

I agree. Really seasoned data people are not common enough. Small companies need to buy services to lighten the load.

We both seem have a sense of the size of companies at different percentiles. At what percentile would you put your company with petabytes of data?

link

sanderjd 1030 days ago

Super hard to say, so ... 80th or 90th? With very low confidence.

But I do have very high confidence that the 99th percentile is much larger than petabytes (think: what's next after "exa"), and I believe that many companies these days crack into "peta" territory.

But as I saw another comment mention, I think another, probably more important, consideration besides size in bytes is cardinality and structure. So maybe this whole classification we're doing is kind of beside the point :)

link