Hacker News new | ask | show | jobs
by aschwad 1336 days ago
Interesting initiative! Do I understand correctly, that any push mechanism is done via the ODD API and pull mechanisms check on the schema of data sources? Do you already have a standard for providing ETL metadata? On which level of detail are you collecting this metadata?

At BMW, the data catalogue is continuously growing and the amount of datasets is increasing rapidly. Therefore we had a similar problem to find out how datasets relate to each other and how they are transformed --> we needed coarse- and fine-grained data lineage. We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a

- paper: https://link.springer.com/article/10.1007/s13222-021-00387-7

- and blog post: https://medium.com/@alex.schoenenwald/fishing-for-data-linea...

2 comments

Thank you!

Actually everything is working on a push basis in ODD now. ODD Platform implements ODD Specification (https://github.com/opendatadiscovery/opendatadiscovery-speci...) and all agents, custom scripts and integrations, Airflow/Spark listeners, etc are pushing metadata to specific ODD Platform's endpoint (https://github.com/opendatadiscovery/opendatadiscovery-speci...). ODD Collectors (agents) are pushing metadata on a configurable schedule.

ODD Specification is a standard for collecting and gathering such metadata, ETL included. We gather metadata for lineage on an entity level now, but we plan to expand this to the column-level lineage at the end 2022 — start 2023. Specification allows us to make the system open and it's really easy to write your own integration by taking a look in what format metadata needs to be injected in the Platform.

ODD Platform has its own OpenAPI specification (https://github.com/opendatadiscovery/odd-platform/tree/main/...) so that the already indexed and layered metadata could be extracted via platform's API.

Also, thank you for sharing links with us! I'm thrilled to take a look how BMW solved a problem of lineage gathering from Spark, that's something we are improving in our product right now.

I'm sorry if its open source where is the code and if there is no code please stop calling a blogpost as opensource
> open-sourced our approach True, the code isn't - yet we figured that sharing the architecture, procedures and data model could be helpful for others too. IMO this is still a way of open-sourcing an architecture.
You can find source code of our platform here: https://github.com/opendatadiscovery/odd-platform

We are 100% opensource, not only architecture but also implementation

The comment was for the GP who posted the blog and paper link. It was not for open data discovery, sorry if there was any confusion. Excellent work at Open Data Discovery. I intend to try out this software.
It is useful for sure, but open source has a very specific meaning, sharing the source code. Its always useful to share any and all relevant information, but calling it open source muddies the water.