Hacker News new | ask | show | jobs
by nerdponx 906 days ago
I think I agree with this to some extent in that it's hard for me to imagine a use case where I have a bunch of clean Parquet files, nicely partitioned, in some kind of cloud storage system.

If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

I don't echo your dismissal of the idea because a whole lot of people seem to be excited about it. But I personally feel like I'm missing the use case compared to the lake + warehouse setup.

Is it about distributing responsibility across teams? Reducing storage cost? Open source good vibes?

Maybe a legitimate use case is being able to use the same data source for multiple query engine frontends? That is, you can use both Spark and Snowflake on the same physical data files.

I'd be interested to hear about this from someone who's using or planning to use a lakehouse.

2 comments

In my experience,

* Storing large amounts like petabytes in any database is phenomenally expensive, just for the storage alone.

* For some kinds of data, like image data, databases are generally the wrong tool.

* The consumers of these kinds of systems may have really dynamic workloads. Imagine ML jobs that kick off 1K machines simultaneously to hammer your DB and read from it as fast as possible. Cloud-managed object stores have solved this scaling issue already. If you can get infrastructure you manage out of the way, you get to leverage that work. If your DB is in the middle, you're on call for it.

> If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

Well, depends on your requirements. You can definitely go point-to-point straight into another DB.

One reason to keep data in object storage, is it gives you a sort of “db independent” storage layer. At a previous $work, we had a tiered system: data would come in from source systems (primary application db’s, marketing systems, etc), and would be serialised verbatim in structured format in S3 (layer 1). Data eng systems would then process that data-refining it, enriching it, ensuring types and schemas, etc, which would be serialised into the next tier (layer 2). At this level they’d be nice to use, so the data analysts would operate against this data in their spark notebooks.

BI and reporting, and other applications could either use data from this layer directly, or if they had special requirements, or performed computationally difficult enough tasks, we would add another layer (layer 3) for specialised workloads and presentation layers. Layer 2 and 3 data may also be synced into data warehouses like ClickHouse.

This gave us complete lineage of data (no more mystery tables, no more “where did you get this data from”, etc), and the storage itself is reasonably cheap. Many services can query these storage layers directly, so setting up views, or projections into different layouts - even for huge quantities of data- becomes feasible and achievable with no more engineering effort than a query.

Was it a lot? Yep. Would I recommend or do it everywhere? Absolutely, 100% no I would not. Was it a good fit for that org? Yeah, arguably better than they could utilise, but for them, other approaches were anaemic and fragile at best.

Could it be done simpler now? Yep, but it got the job done then haha.

Edit to add: it was also language agnostic, which was a huge win and is an understated part of these new parquet-based solutions: you’re no longer limited to “fragile python app” or “spark cluster” to interact with your data. Rust, C#/F#, various FE tools for JS/TS (cube, etc). This is a huge win because you’re not longer tied to keeping around an aging spark/hadoop cluster that has gradually encrusted more garbage into it until it’s this massive, ultra-fragile time bomb nobody dares touch that powers mass amounts of back-office-business needs.