|
|
|
|
|
by rubenfiszel
1292 days ago
|
|
I am working on something adjacent to this problem. We focus much less on data pipelines but on automation, but in the end also have an abstraction for flows that one can use to build data pipeline. The locking-in issue was something I thought a lot about and ended up deciding that our generic steps should just be plain code in typescript/python/go/bash, the only requirement is that those snippets code have a main function and return a result. We built the https://hub.windmill.dev where users can share their scripts directly and we have a team of moderators to approve the one to integrate directly into the main product. The goal with those snippets is that they are generic enough to be reusable outside of Windmill and they might be able to work straight out of the box for orchest for the python ones. nb: author of https://github.com/windmill-labs/windmill |
|
I’ve been leaning towards this direction. I think I/O is the biggest part that in the case of plain code steps still needs fixing. Input being data/stream and parameterization/config and output being some sort of typed data/stream.
My “let’s not reinvent the wheel” alarm is going off when I write that though. Examples that come to mind are text based (Unix / https://scale.com/blog/text-universal-interface) but also the Singer tap protocol (https://github.com/singer-io/getting-started/blob/master/doc...). And config obviously having many standard forms like ini, yaml, json, environment key value pairs and more.
At the same time, text feels horribly inefficient as encoding for some of the data objects being passed around in these flows. More specialized and optimized binary formats come to mind (Arrow, HDF5, Protobuf).
Plenty of directions to explore, each with their own advantages and disadvantages. I wonder which direction is favored by users of tools like ours. Will be good to poll (do they even care?).
PS Windmill looks equally impressive! Nice job