| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by haltingproblem 2179 days ago
	This does not solve the issue of compute scalability - slow computations, which are fundamentally opaque, applied to large data frames . Given a series of data frames (or one large one that can be chunked) how do I apply a long running function to each chunk. For that you need scalability across cores and machines hence Dask.

1 comments

jzwinck 2179 days ago

Why do you consider computations to be opaque? Do you not have the source code?

There is a ton of low hanging speed in many computations that people treat as black boxes. Often as the result of knowing something extra about the specific input data rather than relying on a generic implementation.

In some cases all you need is to write NumPy code instead of Pandas code for a 2-3x speedup. Then suddenly your small cluster program runs on one machine.

link

isoprophlex 2179 days ago

Besides the speedup from using native numpy, theres also the potential for 50-100x speedup if your code isn't vectorized to begin with, and anywhere from 1-1000x if theres a couple of joins in there that you can optimize.

But for the latter, see discussion on shifting the pd compute to a RDBMS elsewhere in these comments.

link

haltingproblem 2178 days ago

SK Learn is the most popular ML libs. Well written, source code available, etc. But I am not opening it up to optimize it and neither should anyone unless they are already a SK Learn contributor OR have a ton of time on their hands.

link