|
|
|
|
|
by nerdponx
906 days ago
|
|
I think I agree with this to some extent in that it's hard for me to imagine a use case where I have a bunch of clean Parquet files, nicely partitioned, in some kind of cloud storage system. If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database? I don't echo your dismissal of the idea because a whole lot of people seem to be excited about it. But I personally feel like I'm missing the use case compared to the lake + warehouse setup. Is it about distributing responsibility across teams? Reducing storage cost? Open source good vibes? Maybe a legitimate use case is being able to use the same data source for multiple query engine frontends? That is, you can use both Spark and Snowflake on the same physical data files. I'd be interested to hear about this from someone who's using or planning to use a lakehouse. |
|
* Storing large amounts like petabytes in any database is phenomenally expensive, just for the storage alone.
* For some kinds of data, like image data, databases are generally the wrong tool.
* The consumers of these kinds of systems may have really dynamic workloads. Imagine ML jobs that kick off 1K machines simultaneously to hammer your DB and read from it as fast as possible. Cloud-managed object stores have solved this scaling issue already. If you can get infrastructure you manage out of the way, you get to leverage that work. If your DB is in the middle, you're on call for it.