| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by madenine 2428 days ago
	depends what you’re doing. For querying large datasets? 100% with you. For data cleaning, processing, analytics, ML on decently large datasets? Spark wins out

2 comments

missosoup 2428 days ago

What does spark win at exactly?

Dask+Perfect is a much better experience all round including perf, with virtually none of the cluster management hell involved.

link

sandGorgon 2428 days ago

Could you talk about Prefect ? we are in the process of moving from Spark to Dask. I have never heard of prefect. what do you use it for ?

link

missosoup 2428 days ago

tl;dr we use it for a similar set of tasks that one would use Airflow for.

Unlike Airflow, this lends itself to microbatching and streaming. Plus a bunch of housekeeping items ticked off that Airflow never got around to. With a bit of devops engineering time, you can have perfect manage the size of your worker cluster on k8s and scale it up/down with ingest demand, etc.

I'll say one thing though. The Perfect website used to be a lot more technical and explicit about what it is and isn't. Now it's mostly sales gobbledegook. Maybe not a good sign. I've seen this happen before with dremio.

link

sandGorgon 2427 days ago

This is super interesting!

Do you run dask on k8s ? I have been concerned that dask does not leverage kubernetes HPA for autoscaling...but instead chooses to run an external scheduler.

How has your experience been ?

link

ai_ja_nai 2428 days ago

Very interesting. Can't find references to "Perfect", though; could you please point to a link?

link

missosoup 2428 days ago

https://www.prefect.io

Not the most SEO-friendly choice of name. Great product though.

link

vamega 2427 days ago

Are you using their cloud product? The core/open source product doesn't have a way to persist schedule data.

link

maximente 2428 days ago

looks like dask is python-only, so it's a nonstarter (loser) for already existing JVM code that runs on spark

link

missosoup 2428 days ago

Spark stacks inevitably end up with PySpark though. It's rework for people who already committed to Spark, sure. And for bigger projects that committed to Spark this change isn't justifiable. But for a greenfield project, choosing Spark is just silly today.

link

truth_seeker 2428 days ago

Say for an example, I am using PostgreSQL 12 + CitusDB extension

Data cleaning -> PL/SQL and various inbuilt functions for the transformation of data (or new UDF if required at all)

Processing -> PostgreSQL Parallel processing on the local node and Citus DB extension for distributed computing and sharding

Analytics -> Many options here. Materialized views OR Triggers OR Streaming computation with PipelineDB extension OR Using Logical replication for stream computation

ML -> PG support variety of statistics functions. It also supports PL/R and PL/Python extension to interface with ML libraries.

Also, there are various kinds of Foreign Data Wrappers supported by PG - https://wiki.postgresql.org/wiki/Foreign_data_wrappers

link

missosoup 2428 days ago

Yeah that's not going to work for what people call analytics workloads today.

PG is great but it's not suitable to be a feature store and sure as hell not suitable to fan out ML workloads. In a modern ML stack, PG might play the role of the slow but reliable master store that the rest of the ML pipeline feeds off.

link

riku_iki 2428 days ago

> hell not suitable to fan out ML workloads

depends on the scale? Not everyone processes petabytes of data.

> PG might play the role of the slow

You have any benchmark in your hand to support this? I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

link

missosoup 2428 days ago

> I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

There's no question about this. If you can express your task in terms of PG on a single instance, then you probably should.

When you get to more complex tasks, like running input through GloVe and pushing ngrams to a temporal store, PG offers very little - which is fine, it's not at all what PG is designed for. Inter-node IO eclipses single node perf, which is why Spark is used despite being a terribly inefficient thing (although in the case of Spark, it's so inefficient that for interim sized workloads you'd actually be better off vertically scaling a single node and using something else). PG won't help at all with these tasks.

Also, that smorgasbord of extensions GP listed isn't offered by any cloud vendor as a managed service afaik, meaning you must roll and manage your own. Depending on your needs, that might be a show stopper.

link

riku_iki 2427 days ago

> like running input through GloVe and pushing ngrams to a temporal store

why exactly you think PG will not do this well?

link

missosoup 2427 days ago

Tell me how you'd do it and I'll tell you why it won't work :)

link