| > Could you share more specific details? Happy to look over / revise where needed. Sure thing! I'd say first off, the solutions may look different for a small company/startup vs. a large enterprise. It can help if you explain the scale at which you are solving for. On the enterprise side of things, they tend to buy solutions rather than build them in-house. Things like Informatica, Talend, etc. are common for large enterprises whose primary products are not data or software related. They just don't have the will, expertise, or the capital to invest in building and maintaining these solutions in-house so they just buy them off the shelf. On the surface, these are very expensive products, but even in the face of that it can still make sense for large enterprises in terms of the bottom line to buy rather than build. For startups and smaller companies, have you looked at something like `dbt` (https://github.com/dbt-labs/dbt-core) ? I understand the desire to write some code, but often times there are already existing solutions for the problems you might be encountering. ORM's should typically only exist on the consumer-side of the equation, if at all. A lot of business intelligence / business analysts are just going to use tools like Tableau and hook up to the data warehouse via a connector to visualize their data. You might have some consumers that are more sophisticated and may want to write some custom post-processing or aggregation code, and they could certainly use ORM's if they choose, but it isn't something you should enforce on them because it's a poor place to validate data since as mentioned there are different ways/tools to access the data and not all of them are going to go through your python SDK. Indeed in a large enough company, you are going to have producers and consumers that are going to use different tools and programming languages, so it's a little bit presumptuous to write an SDK in python there. Another thing to talk about, and this probably mostly applies to larger companies - have you looked at an architecture like a distributed data mesh (https://martinfowler.com/articles/data-mesh-principles.html)? This might be something to bring to the CTO more than try to push for yourself, but it can completely change the landscape of what you are doing. > More broadly is the issue of the gap of what you think the role is, and what the role actually is when you join. There are definitely cases where this is accidental. The best way I can think of to close the gap is to maybe do a short-term contract, but may be challenging to do under time constraints etc. Yeah this definitely sucks and it's not an enviable position to be in. I guess you have a choice to look for another job or try to stick it out with the company that did this to you. It's possible there is a geniune existential crisis for the company and a good reason why they did the bait-and-switch. Maybe it pays to stay, especially if you have equity in the company. On the other hand, it could also be the case that it is the result of questionable practices at the company. It's hard to make that call. |
We did start integrating dbt towards the end of my time in the role. Our data stack was built in 2018, so a fair bit of time before data infra-as-a-service became a thing. The idea is dbt would help our internal consumers to more easily self serve. That said I did see complaints about dbt pricing recently; as they say there’s no free lunch.
Re: ORMs, I respectfully disagree. I’ve come across many teams that treat their Python/Rust/Go codebase with ownership and craft, I have not seen the same be said about SQL queries. It’s almost like a 'tragedy of the commons’ problem - columns keep getting added, logic gets patched, more CTEs to abstract things out but in the end adds to the obfuscation.
ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms).
Interestingly enough, I currently work on ML and impressed to see how much modeling can be done in the cloud compared to my earlier stint in the space (which had a dedicated engineering team focused on features and inference). On the flipside I similarly see an explosion of SQL strings, some parts handled with care more than others.
I’ve not looked into a data mesh but a friend did mention pushing his org to embrace it - self note to follow up to see how that's going. Looks like there are a couple of ‘dimensions’ to it; my broader take is that keeping things sensible is both a technical and organizational challenge.
I look forward to future blog posts on ‘how we refactored our SQL queries’, maybe there’s a startup idea there somewhere.