|
|
|
|
|
by zamalek
297 days ago
|
|
One things have seen through my more recent exposure to experienced data engineers is the lack of repeatability rigor (CI/CD, IaC, etc.). There's a lot of doing things in notebooks and calling that production-ready. Databricks has git (GitHub only from what I can tell) integration, but that's just checking out and directly committing to trunk, if it's in git then we have SDLC right, right? It's fucking nuts. Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way. |
|
There are plenty of us out here with many repos, dozens of contributors, and thousands of lines of terraform, python, custom GitHub actions, k8s deployments running airflow and internal full stack web apps that we're building, EMR spark clusters, etc. All living in our own Snowflake/AWS accounts that we manage ourselves.
The data scientists that we service use notebooks extensively, but it's my teams job to clean it up and make it testable and efficient. You can't develop real software in a notebook, it sounds like they need to upskill into a real orchestration platform like airflow and run everything through it.
Unit test the utility functions and helpers, data quality test the data flowing in and out. Build diff reports for understanding big swings in the data to sign off changes.
My email is in my profile I'm happy to discuss further! :-)