| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hcrisp 3764 days ago

1/3 time is spent in the p_reduce step, and another 1/3 in elemwise. Not exactly sure what those do, but I'm guessing it's related to the reduce-map-reduce steps of evaluating the standard deviation and then dividing the elements by this value. The mean has to be calculated twice in the formula of the z score. It sounds like the client-worker communication mechanism might have extra latency.

I wonder if this would work if the dask arrays are not equal in length, for example if the files were time series of unequal duration.

Also, are there any plans for dask to support distributed numpy functions requiring kernel computation at the array boundaries? For example scipy.signal.lfilt? I believe it would require ghosting or further inter-dask-array communication that is not yet present.

1 comments

mrocklin 3764 days ago

See http://dask.pydata.org/en/latest/ghost.html

link