|
|
|
|
|
by FridgeSeal
1416 days ago
|
|
> Shift 1: “We know the lineage” to “We know what in god’s name is happening” Bro I can't even get my company to the _first_ part, and we're collectively already having issues with the second? What is everyone else's read on this situation in general? Do you all have row and table level lineages for your data? For pipelines that people are actively using? Every company I've ever been in can hardly figure out where finance gets last years "magical excel sheet", let alone be close to a spot where they're actively using data lineage tools. I also don't like Airflow, but for somewhat different reasons. I think it couples orchestration and transformation too tightly, I don't understand the desire to integrate everything with your actual runtime Python code - I think it's markedly the wrong level of abstraction/integration and limits your engineering capacity. There's undoubtedly some good engineering, it's come a long way, and it's mighty popular, but every time I look at a repo that uses it, the only read I get is "cross-cutting-chaos". |
|
In life sciences research to support synthetic control arms, the FDA is caring more about the lineage/manipulation of the data than the data science models used to predict X/Y/Z.
IE - what was the data originally, what did it end up as prior to ingestion into AIML, why was it changed, what steps were involved, etc.
There are not a ton of good out of the box solutions for data lineage and its driving me nuts.
We have Apache NIFI which promises data lineage out of the box and _appears_ to deliver. I've never implemented it though.
We have pachyderm which has some support here but I don't know about it.
Besides that it appears roll-your-own.
I kind of wish there was an accepted best practice for data lineage but its - surprisingly - wild west. And its completely 100% required for industry use.