| Of note, Java !== JVM. Spark and Flink, for instance, are written in Scala which is alive and well :). My best effort in finding replacements of those tools that don't leverage the JVM: HDFS: Any cloud object store like S3/AzBlob, really. In some workloads data locality provided by HDFS may be important. Alluxio can help here (but I cheat, it's a JVM product) Spark: Different approach but you could use Dask, Ray, or dbt plus any SQL Analytical DB like Clickhouse. If you're in the cloud, and are not processing 10s TB at a time, spinning an ephemeral HUGE VM and using something in-memory like DuckDB, Polars or DataFrame.jl is much faster. Yarn: Kubernetes Jobs. Period. At this point I don't see any advantage of Yarn, including running Spark workloads. Hive: Maybe Clickhouse for some SQL-like experience. Faster but likely not at the same scale. Storm/Flink/Cassandra: no clue. My preferred "modern" FOSS stack (for many reasons) is Python based, with the occasional Julia/Rust thrown in. For a medium scale (ie. few TB daily ingestion), I would go with: Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge ephemeral VMs. |