| Perhaps the first thing I’d clarify is not all the ‘bad’ things described happened to me personally, and out of the ones that did, I employed artistic licence in the recollection. We did start integrating dbt towards the end of my time in the role. Our data stack was built in 2018, so a fair bit of time before data infra-as-a-service became a thing. The idea is dbt would help our internal consumers to more easily self serve. That said I did see complaints about dbt pricing recently; as they say there’s no free lunch. Re: ORMs, I respectfully disagree. I’ve come across many teams that treat their Python/Rust/Go codebase with ownership and craft, I have not seen the same be said about SQL queries. It’s almost like a 'tragedy of the commons’ problem - columns keep getting added, logic gets patched, more CTEs to abstract things out but in the end adds to the obfuscation. ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms). Interestingly enough, I currently work on ML and impressed to see how much modeling can be done in the cloud compared to my earlier stint in the space (which had a dedicated engineering team focused on features and inference). On the flipside I similarly see an explosion of SQL strings, some parts handled with care more than others. I’ve not looked into a data mesh but a friend did mention pushing his org to embrace it - self note to follow up to see how that's going. Looks like there are a couple of ‘dimensions’ to it; my broader take is that keeping things sensible is both a technical and organizational challenge. I look forward to future blog posts on ‘how we refactored our SQL queries’, maybe there’s a startup idea there somewhere. |
> ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms).
I get what you're saying, but think about a large org with a lot of different teams and heterogenous data stores - it's gonna be pretty hard to implement a top-down directive to tell everyone to use such and such ORM library, or to ensure a common level of ownership and craft. This is where SQL is the lingua franca and usually the native language of the data stores themselves and is a common factor between most/all of them. This is also where tools like Trino / PrestoSQL can come in and provide a compatibility layer at the SQL level while also providing really nice features such as being able to do joins across different kinds of data stores / query optimization / caching / access control / compute resource allocation.
In general it's hard to get things to flow "top down" in larger orgs, so it's better to address as much as you can from the bottom up. This includes things like domain models - it's gonna be tough to get everyone to accept a single domain model because different teams have different levels of focus and granularity as they zoom into specific subsets so they will tend to interpret the data in their own ways. That's not to say any of them are wrong, there's a reason why that whole data lake concept of "store raw unstructured data" came in where the consumer enforces a schema on read. This gives them the power to look at the data from their own perspective and interpretation. The more interpretation and assumptions you bake into the data before it reaches the consumers, the more problems you tend to run into.
That's not to say that you can't have a shared domain model between different teams. There are unsurprisingly also products out there that provide the enterprise the capability to collaboratively define and refine shared domain models, which can then be used as a lens/schema to look at the data. Crucially the domain model may shift over time, so this decoupling of the domain model from the actual schema of the stored data allows for the domain model to evolve over time without having to go back and fix the stored data because we have not baked in any assumptions or interpretations into the stored data itself.