Just out of curiosity, it seems like the process function which you define has to run remotely on workers. How does it get serialized? Are there limitations to the process function due to serialization?
Great question! We actually looked at using the workflow abstraction for batch processing in our runner, but ultimately didn't because it was still in alpha (we use the dataset API for batch flows).
I think one area where we differ is our focus on streaming processing which I don't think is well supported with the workflow abstraction, and also having more resource management / use case driven IO.
Makes a ton of sense! I was present at the demo for this at last year's Ray conference and I definitely got the sense that a lot of the orchestration details were still being thought through, and that it was not yet a first-class streaming product.
Definitely like seeing more streaming-focused orchestration tools out there - it's a growing niche with not enough alternatives to Beam
I think the most common limitation will be ensure that your output is serializable. Typically returning python dictionaries or dataclasses is fine.
But if you had a specific limitation in mind let me know happy to dive into it!