Hacker News new | ask | show | jobs
by ironchef 1032 days ago
Here was my situation. Occasional queries. Over a couple petabyte of data. Customer facing so response in seconds per SLA but > 95 percent of the time the warehouse isn’t running. Cached queries from within 24 hours which don’t require the warehouse to even spin up. Our snowflake costs were significantly less than an FTE.

Would that potentially be a situation which “running your own” doesn’t make sense?

2 comments

>Would that potentially be a situation which “running your own” doesn’t make sense?

Look into datalake architectures. RDBMS based data warehousing is obviously not economical at the petabyte scale. But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

> Look into datalake architectures.

Yup .. comfy with iceberg/delta/hudi

> RDBMS based data warehousing is obviously not economical at the petabyte scale.

I never said it was .. I'm simply responding to "I simply cannot understand how anyone chooses this over running your own Spark clusters with Jupyterlab". I'm trying to help you understand why folks would choose a SaaS over run your own.

> But storing all that data in S3 with Delta Lake/Iceberg format and querying with Spark changes things entirely. You only pay for object storage, and S3 read costs are trivial.

No. You don't just pay for object storage + minor S3 read costs.

You pay for operations You pay for someone setting up conventions You pay to not have to optimize data layouts for streaming writes You pay to not have to discover race conditions in s3 when running multiple spark clusters writing to same delta tables You pay to not have to discover that your partitions/clustering needs have changed based on new data or query patterns

But look .. I get it. You have chosen to optimize for cost structures in one way .. and I've chosen to optimize in a different way. In the past I've done exactly as you've said as well. I think being able to seeking to see _why_ folks may have chosen a different path may help you understand other areas to consider in operations.

If you have petabytes of data, I don't think this article is talking about your use case.
I think it is?

Or I guess, what data size do you think it's talking about? If you only have gigabytes of data, none of this matters, you can use anything pretty cheaply and easily. So is this article just for "terabytes" or does it go up to "hundreds of terabytes" but not "petabytes"?

Hmm, I suppose it's a bit challenging to say. I initially thought that it wasn't for the 80% smallest companies and petabytes of data is probably puts you in the top 20%. (Most businesses are small businesses after all.)

However, I now realize that th biggest companies probably should manage their own data. If you're Google why would you use Snowflake?

So I don't know if you are the target audience for this blog post. It's pretty ambiguous.

I guess I'll say what I think. I do think it is targeted at that smallest 80% of companies with some digital footprint, and also at most of the top 20%. Or more specifically, I think maybe it's targeted at like the 5th percentile to the 99th percentile. That bottom 5% probably just needs Excel, and that top 1% is probably writing or heavily modifying all their own tools.

But I'm not sure the advice is very good from the 5th percentile up to ... maybe that top 20%? A lot of the stuff in the article assumes the availability of sophisticated data architects and mature infrastructure groups that I really don't think the median company has.

I agree. Really seasoned data people are not common enough. Small companies need to buy services to lighten the load.

We both seem have a sense of the size of companies at different percentiles. At what percentile would you put your company with petabytes of data?

Super hard to say, so ... 80th or 90th? With very low confidence.

But I do have very high confidence that the 99th percentile is much larger than petabytes (think: what's next after "exa"), and I believe that many companies these days crack into "peta" territory.

But as I saw another comment mention, I think another, probably more important, consideration besides size in bytes is cardinality and structure. So maybe this whole classification we're doing is kind of beside the point :)