Hacker News new | ask | show | jobs
by disgruntledphd2 1984 days ago
So do I, theoretically at least.

But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.

Like, I understand that the SQL portion is pretty commoditised (though even there, SparkSQL python and R API's are super nice), but I'm not aware of any other frameworks for doing distributed training of ML models.

Have all the hipsters moved to GPUs or something? \s

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).

That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.

One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.

All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

1 comments

>I'm not aware of any other frameworks for doing distributed training of ML models.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning. You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

edit:

>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.

> Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.

UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.

I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning.

I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these. I haven't ever tried to do it, but based on my experience of running deep learning frameworks and distributed ML, I suspect the combination of both to be exponentially more annoying ;) (And i deal mostly with structured data, so it doesn't buy me as much).

> You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

Do people really find these tools useful? Again, I'm not really sure what SageMaker (for example) buys me on AWS, and their pricing structure is so opaque that I'm hesitant to even invest time in it.

>UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.

Not sure how it's different from what you can do in Spark in terms of data transformations. Taking a list of objects as an argument basically allows your UDF to do arbitrary computations on tabular data.

> I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these.

That's true, distributed unsupervised methods aren't done in most other places I know of. I'm guessing there's ways to do that with neural network although I haven't looked into it. The datasets I deal with have structure in them between events even if they're unlabeled.

>I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.

I used to do that on AWS. For our use case, Athena ate its lunch in terms of performance, latency and cost by an order of magnitude. Snowflake is priced based on demand so I suspect it'd do likewise.

Spark has a superset of the functionality Athena has. Athena is faster, but it's also very limited. They're not designed to do the same thing.