Hacker News new | ask | show | jobs
by ricklamers 1302 days ago
First want to say congrats to the Patterns team for launching a gorgeous looking tool. Very minimal and approachable. Massive kudos!

Disclaimer: we're building something very similar and I'm curious about a couple of things.

One of the questions our users have asked us often is how to minimize the dependence on "product specific" components/nodes/steps. For example, if you write CI for GitHub Actions you may use a bunch of GitHub Action references.

Looking at the `graph.yml` in some of the examples you shared you use a similar approach (e.g. patterns/openai-completion@v4). That means that whenever you depend on such components your automation/data pipeline becomes more tied to the specific tool (GitHub Actions/Patterns), effectively locking in users.

How are you helping users feel comfortable with that problem (I don't want to invest in something that's not portable)? It's something we've struggled with ourselves as we're expanding the "out of the box" capabilities you get.

Furthermore, would have loved to see this as an open source project. But I guess the second best thing to open source is some open source contributions and `dcp` and `common-model` look quite interesting!

For those who are curious, I'm one of the authors of https://github.com/orchest/orchest

2 comments

Yes, great point, we share that concern. All of our components (patterns/openai-completion@v4) are open-source and can be downloaded and "dehydrated" into your Patterns app. They all use the same public API available to all apps.

We're working towards a fully open-source execution engine for Patterns -- we want people to invest with full confidence in a long-term ecosystem. For us, sequencing meant dialing in the end-to-end UX and then taking those learnings to build the best framework and ecosystem with a strong foundation. Stay tuned!

Thank you for the kind words and congrats on the great work on Orchest!

I am working on something adjacent to this problem. We focus much less on data pipelines but on automation, but in the end also have an abstraction for flows that one can use to build data pipeline. The locking-in issue was something I thought a lot about and ended up deciding that our generic steps should just be plain code in typescript/python/go/bash, the only requirement is that those snippets code have a main function and return a result. We built the https://hub.windmill.dev where users can share their scripts directly and we have a team of moderators to approve the one to integrate directly into the main product. The goal with those snippets is that they are generic enough to be reusable outside of Windmill and they might be able to work straight out of the box for orchest for the python ones.

nb: author of https://github.com/windmill-labs/windmill

Thanks for chipping in.

I’ve been leaning towards this direction. I think I/O is the biggest part that in the case of plain code steps still needs fixing. Input being data/stream and parameterization/config and output being some sort of typed data/stream.

My “let’s not reinvent the wheel” alarm is going off when I write that though. Examples that come to mind are text based (Unix / https://scale.com/blog/text-universal-interface) but also the Singer tap protocol (https://github.com/singer-io/getting-started/blob/master/doc...). And config obviously having many standard forms like ini, yaml, json, environment key value pairs and more.

At the same time, text feels horribly inefficient as encoding for some of the data objects being passed around in these flows. More specialized and optimized binary formats come to mind (Arrow, HDF5, Protobuf).

Plenty of directions to explore, each with their own advantages and disadvantages. I wonder which direction is favored by users of tools like ours. Will be good to poll (do they even care?).

PS Windmill looks equally impressive! Nice job

Yes, inputs/outputs is likely the most interesting problems for our diverse specs of flows.

Because data pipeline is not the primary concerns of Windmill, we took the stance that Inputs and Output of steps were simply JSON in, JSON out. For all the languages, we simply extract the JSON object into the different parameters of the main, and then we wrap the return into the respective language native serializer for the output (e.g JSON.stringify in Typescript). Then each step can use a javascript expression executed by v8 to do some lightweight transformation between the output of any step to the input of that step.

A lot of the simplification we made is actually parsing the main function parameters into the corresponding jsonschema, supporting deeply nested objects when relevant.

That works great for automation that do not have big input/outputs, but not for data. So what we do for data is to use a folder that we symlink to be shared by all steps if a specific flag for that flow is set. It also force us to have the same worker process all the steps inside that flow when otherwise flow steps could have been processed by any workers. It is very fast since it's all local filesystem but not super scalable.

I am not pleased with that solution and believe that if we were to expand on the data problem, we would certainly rely on fast network and HDFS/Amazon EFS/etc to simply share that mounted folder across the network.

Anyway, sorry for the rambling but I do feel like we're all taking different approach to the same underlying problem of building the best abstraction for flows and believe we might learn from each other's choices.

ps: congrats Patterns on the launch, the tool look absolutely amazing.