Hacker News new | ask | show | jobs
by CodesInChaos 41 days ago
The "dataset" abstraction this article proposes feels rather specific for their use-case, not universal. None of my S3 use-cases would benefit from it.

Just store such metadata in your database, where you can organize, index and aggregate it whatever way you like.

1 comments

Curious what your use cases look like. If you're storing data where you always know what's there, who created it, and whether it's still in use without needing to query for it, that's actually a great place to be. The post is about the much messier middle ground most teams I've talked to are in.
Some of the most important ones are:

1. Invoice PDFs. Individually small, but there are a hundred million of them. Deleted after 10 years or when the tenant deletes their account.

2. Reports and exports. Few but potentially big files. If an export logically consists of multiple files, it's stored as zip file. Live 30 days or until the tenant deletes their account.

3. Streaming database exports using AWS Database Migration Service for replication into Snowflake

Every file has an entry in the database tracking its storage location and status.

Grouping them by tenant, (sub)type or time-interval makes sense for these. But "dataset" isn't an applicable concept.

That's a clean architecture and the dataset abstraction isn't really needed when every file has a DB row and clear lifecycle.

The post is more about the pipeline / ML / log / export world where ownership isn't enforced by application code.

The DMS case sits somewhere in between - there's a per-table grouping that could be useful, but the files are usually transient enough that it doesn't matter much. Different problem from yours.