Hacker News new | ask | show | jobs
by dchess 41 days ago
I feel like this is what delta lake and ducklake are largely solving for. And then some.
1 comments

They solve it, partially, for tabular data. Delta, Iceberg, DuckLake are all table formats. And yeah, they do more than dataset abstraction (transactions, time travel, schema evolution).

But that's just one slice of storage. Most teams also have logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format. And even with tables, you often can't easily look at a Delta table and know what the underlying storage is costing you, whether it's still accessed, etc.

Another system might solve it for your media files, another for your log streams, and so on. That's the thing, you have a set of management nice-to-haves that are quite generic and aren't universally supported today, so you end up reinventing them separately across each domain. And even if you did, you still wouldn't have a central aggregated view across all your storage.

> logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format.

You would be appalled at the kind of stuff I have seen teams stuff into parquet and iceberg tables.

Ha. The fact that teams reach for iceberg to organize things that aren't really tables is itself a symptom of needing better management tools for other types of data.
Sure, but that’s not an S3 concern, because the vast majority of people use S3 as it is, without needing additional management machinery.

The solution is just to spin up the machinery you need for your solution, rather than making S3 cover all possible bases.