|
|
|
|
|
by haltingproblem
2179 days ago
|
|
This does not solve the issue of compute scalability - slow computations, which are fundamentally opaque, applied to large data frames . Given a series of data frames (or one large one that can be chunked) how do I apply a long running function to each chunk. For that you need scalability across cores and machines hence Dask. |
|
There is a ton of low hanging speed in many computations that people treat as black boxes. Often as the result of knowing something extra about the specific input data rather than relying on a generic implementation.
In some cases all you need is to write NumPy code instead of Pandas code for a 2-3x speedup. Then suddenly your small cluster program runs on one machine.