Hacker News new | ask | show | jobs
by KptMarchewa 2135 days ago
Why not use Apache Flink?
1 comments

While Flink is a fully-featured stream processing framework I think there's some notable differences. Off the top of my mind:

- Flink uses Zookeeper for metadata and coordination, Jet doesn't require any external systems for resilience.

- Flink uses RocksDB and HDFS for checkpointing/snapshotting, Jet stores it in distributed, replicated in-memory store.

- Flink allocates operators to slots, while Jet uses green threads/cooperative multi-threading. This means you can run many concurrent streaming jobs on the same cluster, with very low overhead.

- Jet is basically a single, self-contained JAR. It's all you need to run a production-grade service (+ some connectors, if you'd like)

- Jet can scale up/down with very little friction. You start a couple of processes and they will form a cluster automatically. Kill a couple of the processes, and the cluster goes on.

That said, Flink have a great set of overall features, especially around persistence and huge states. This is another area we're currently investing in as well as SQL support.

> Flink allocates operators to slots, while Jet uses green threads/cooperative multi-threading. This means you can run many concurrent streaming jobs on the same cluster, with very low overhead.

How does the shift to cooperative multi-threading change the way that the cluster is used? In the "slot" approach, Alice and Bob can run concurrent jobs with relatively little coordination needed to "share" effectively -- e.g. they might use different branches of the same shared repo. In exchange for the lower-overhead, does Jet's approach require that multiple use cases are more carefully planned?

This is indeed a question that we get asked a lot. We have so far not though about adding more advanced scheduling capabilities for the cooperative threads. With the slot system, if you have 48 core available in the cluster, and running 8 jobs, each job will only use 6 cores each. With cooperative threading, each job runs on all the 48 cores. We have tested something like 5,000 concurrent jobs on same cluster, but essentially they may be competing for the same resources, so you'll need to do your capacity planning accordingly. Simple way to work around that would be to create separate Jet nodes (a Jet node is very lightweight) so you could have separate execution pools.
I’m not really sure how to imagine SQL support for something like this. Can you point me anywhere that will give me a better idea?
It's not very different than normal SQL. Imagine that instead of a finite result set, you instead have a never ending result set. You can also roughly map operations like windowed aggregation into SQL with some additional syntax. This paper gives a pretty good overview, even though we don't fully agree with the model presented here: https://arxiv.org/abs/1905.12133