| I def agree that there is a pattern to most data pipelines: - read from an input (source) - perform some sort of processing - write the data to some output (sink) This may either be batch or continuous (stream). The inputs may change, the outputs may change. I personally think that sql and duckdb are well positioned to do this. SQL is declarative, standardized and has decades worth of mature implementations. The “source” can be modeled as a table. The “sink” can also be modeled as a table. What does a custom dsl provide over sql? I have a side project called Sqlflow which is attempting to do something similar/ https://github.com/turbolytics/sql-flow It’s not a DSL but the pipeline is standardized using the source, process, sink stages.
Right now the process is pure sql but the source and sink are declarative. SQL has so much prior art, linters and a huge ecosystem with many practitioners. |
Can we unify those worlds? If your project, Sqlflow, manages to let folks stay mostly in SQL—while also elegantly handling side effects—that might be a huge step forward. For strictly data-focused workflows, I’m 100% on board that SQL alone is often the best "DSL" around. The complexity creeps in when we go from "write results to a table" to "call an external system" (possibly with partial commits, retries, or streaming needs). That’s usually where we end up rolling bespoke logic. If Sqlflow can bridge that gap, it’d be awesome. I’ll check it out—thanks for sharing.