| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jaychia 686 days ago

I work on Daft and we’ve been collaborating with the team at Amazon to make this happen for about a year now!

We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.

Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.

This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.

1 comments

thedood 686 days ago

Good to see you here! It's been great working with Daft to further improve data processing on Ray, and the early results of incorporating Daft into the compactor have been very impressive. Also agree with the overall sentiment here that Ray clusters should be able to run best-in-class ETL without requiring a separate cluster maintained by another framework (Spark or otherwise). This also creates an opportunity to avoid many inefficient, high-latency cross-cluster data exchange ops often run out of necessity today (e.g., through an intermediate cloud storage layer like S3).