| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ryan_green 1032 days ago

apologies for not getting into more detail--wanted to start by covering things at a high level. There are a few key concepts that might be helpful. * data state - this is contents of both your data and metadata at a given point in time. if your data doesn't fit into a single database, this can be difficult to manage. We use this technology to help us: https://lakefs.io/ * logical state - this is everything you use in processing the data in your pipeline (i.e. code, config, info for connecting external services, etc.). This can all reside in git

We found the key was associating our logical state (git branch) with our logical state (lakefs branch). We make this association during our branch deployment process.

Let me know if this helps at all. I was planning to write a follow up post about what we learned about managing the logical state of a data pipeline. If you have suggestions for a different topic to dive into, I'd love to hear about it.