Hacker News new | ask | show | jobs
by FridgeSeal 906 days ago
> If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

Well, depends on your requirements. You can definitely go point-to-point straight into another DB.

One reason to keep data in object storage, is it gives you a sort of “db independent” storage layer. At a previous $work, we had a tiered system: data would come in from source systems (primary application db’s, marketing systems, etc), and would be serialised verbatim in structured format in S3 (layer 1). Data eng systems would then process that data-refining it, enriching it, ensuring types and schemas, etc, which would be serialised into the next tier (layer 2). At this level they’d be nice to use, so the data analysts would operate against this data in their spark notebooks.

BI and reporting, and other applications could either use data from this layer directly, or if they had special requirements, or performed computationally difficult enough tasks, we would add another layer (layer 3) for specialised workloads and presentation layers. Layer 2 and 3 data may also be synced into data warehouses like ClickHouse.

This gave us complete lineage of data (no more mystery tables, no more “where did you get this data from”, etc), and the storage itself is reasonably cheap. Many services can query these storage layers directly, so setting up views, or projections into different layouts - even for huge quantities of data- becomes feasible and achievable with no more engineering effort than a query.

Was it a lot? Yep. Would I recommend or do it everywhere? Absolutely, 100% no I would not. Was it a good fit for that org? Yeah, arguably better than they could utilise, but for them, other approaches were anaemic and fragile at best.

Could it be done simpler now? Yep, but it got the job done then haha.

Edit to add: it was also language agnostic, which was a huge win and is an understated part of these new parquet-based solutions: you’re no longer limited to “fragile python app” or “spark cluster” to interact with your data. Rust, C#/F#, various FE tools for JS/TS (cube, etc). This is a huge win because you’re not longer tied to keeping around an aging spark/hadoop cluster that has gradually encrusted more garbage into it until it’s this massive, ultra-fragile time bomb nobody dares touch that powers mass amounts of back-office-business needs.