| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by snidane 593 days ago

The architecture is sound - typically called ELT these days. Dump contents of upstream straight into a database and apply stateless and deterministic operations to achieve the final result tables.

SQL server is where this breaks though. You'll get yelled by DBAs for bad db practices like storing wide text fields without casting them to varchar(32) or varchar(12), primary keys on strings or no indexes at all, and most importantly taking majority of storage on the db host for tbese raw dumps. SQL Server and any traditional database scales by adding machines, so you end up paying compute costs for your storage.

If you use a shared disk system with decoupled compute scaling from storage, then your system is the way to go. Ideally these days dump your files into a file storage like s3 and slap a table abstraction over it with some catalog and now you have 100x less storage costs and about 5-10x increased compute power with things like duckdb. Happy data engineering!

3 comments

WorldMaker 592 days ago

It amazes me how many DBAs think the limit on a varchar column impacts the disk space. The "on disk" size for `varchar(12)` and `varchar(32)` and `varchar(MAX)` are roughly the same and depends on the data itself more than the schema. That's what the "var" in "varchar" means: variable storage size. The limits like (32) were added for compatibility with `char` and for type-based "common sense" validation. Sure, it helps prevent footguns like accidental DDoS of ingesting too much data too quickly, but there are other ways to do that basic top-level validation of "is this too much data to insert?".

Five varchar(12) columns is more storage overhead than one varchar(60). There's a lot of great use cases for varchar(MAX) and everyone I ever had tell me that varchar(MAX) wasn't allowed didn't understand the internals of DB storage that they thought they did and somehow still believe in their internal model of the DB that varchar is just spicy char and fixed column size allocation.

link

icedchai 591 days ago

With Postgres, we mostly just use `text` everywhere, unless there is an actual reason to have a size limit.

In other news, I haven't seen a dedicated "DBA" at a company in over a decade.

link

WorldMaker 591 days ago

> With Postgres, we mostly just use `text` everywhere, unless there is an actual reason to have a size limit.

Yeah, there's still the very rare need to performance engineer out a fixed char field "to the left" of the table to speed up common table scans, but also so many of the reasons you might table scan strings have moved into proper full text search indexes or now all the rage is in vector embeddings.

> In other news, I haven't seen a dedicated "DBA" at a company in over a decade.

Yeah, anecdotally from LinkedIn and other sources it does seem like all the dedicated DBAs that have stayed that way have stuck to very specific niches and/or Oracle Products (including MySQL and derivatives these days; the "Oracle Effect" is strong). Especially in Amazon RDS and Azure SQL Server/Cosmos DB today, Postgres and Microsoft's SQL Server mostly run themselves and day-to-day administration is minor/trivial.

link

sevensor 592 days ago

My experience with delta was that the catalog, being stored in s3 itself, was unacceptably slow, and for our data volume, Airflow was prohibitively expensive. We spent a lot of engineering time working around both problems. Which is funny because the consultants who advised us to do this told us it was the best possible solution; tailor made for our application, foolproof in every way. After that we proceeded to pay for their “data” “science” “services,” which went about as well as my scare quotes would suggest.

link

jamesblonde 593 days ago

You're basically describing the Lakehouse Tables architecture. Store your data as tabular data in Iceberg/Hudi/Delta on S3. Save a bucket on storage. Query with whatever engine you like (Snowflake, Redshift, BQ, DuckDB, etc).

link

aoeusnth1 593 days ago

Yes, this is the vast majority of my data work at Google as well. Spanner + Files on disk (Placer) + distributed query engine (F1) which can read anything and everything (even google sheets) and join it all.

It’s amazingly productive and incredibly cheap to operate.

link