|
|
|
|
|
by georgewfraser
2881 days ago
|
|
Data pipelines are not a great subject for an open-source project. We've been building these for the last 3+ years at Fivetran, and I can tell you that the challenge is: - Studying each source to figure out the right data model
- Chasing down a million weird corner cases
- Working around dumb bugs in the data sources
This is the kind of problem where paying for software really works better. When people build data pipelines in-house, they tend to hack at it until it works for their use case and then stop. When we build data pipelines, we map out every feature of the data source, implement the whole thing at once, and then put it through a beta period with multiple real users. This is easy to do when you have a tight-knit dev team; much harder for a group of part-time open-source contributors. |
|
Personally I work as a "lone wolf" (to my own complains) because I'm in a small company that can't afford a huge team. Most of my (ETL) Transforms are done in SQL which happen to be pretty standardized as opposed to many ETL products I've seen so far.
This solution is probably far from being ready, but I find this approach quite interesting, because it look like a code based ETL that use SQL for transform (so I might be biased). Overall this might result in a more maintainable/versionable data pipeline model than GUI-first ETL which usually generate spaghetti code. Because you are usually forced to regularly adapt data-pipeline to unstable external inputs, being able to easily diff ETL process would be a blessing.