Hacker News new | ask | show | jobs
by cjalmeida 1365 days ago
Of note, Java !== JVM. Spark and Flink, for instance, are written in Scala which is alive and well :).

My best effort in finding replacements of those tools that don't leverage the JVM:

HDFS: Any cloud object store like S3/AzBlob, really. In some workloads data locality provided by HDFS may be important. Alluxio can help here (but I cheat, it's a JVM product)

Spark: Different approach but you could use Dask, Ray, or dbt plus any SQL Analytical DB like Clickhouse. If you're in the cloud, and are not processing 10s TB at a time, spinning an ephemeral HUGE VM and using something in-memory like DuckDB, Polars or DataFrame.jl is much faster.

Yarn: Kubernetes Jobs. Period. At this point I don't see any advantage of Yarn, including running Spark workloads.

Hive: Maybe Clickhouse for some SQL-like experience. Faster but likely not at the same scale.

Storm/Flink/Cassandra: no clue.

My preferred "modern" FOSS stack (for many reasons) is Python based, with the occasional Julia/Rust thrown in. For a medium scale (ie. few TB daily ingestion), I would go with:

Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge ephemeral VMs.

1 comments

There's ScyllaDB as a replacement for Cassandra. https://www.scylladb.com/