Hacker News new | ask | show | jobs
by jzwinck 2179 days ago
The subtitle is "How can you process more data quicker?"

NumPy. It scores an A in Maturity and Popularity, and either an A or a B in Ease of Adoption depending on which Pandas features you use (e.g. GroupBy).

When you're using NumPy as the main show instead of an implementation detail inside Pandas, it is easier to adopt Numba or Cython, and there are huge gains to be made there. Most Pandas workloads on small clusters of say 10 machines or fewer could be implemented on a single machine.

Even simple operations on smallish data sets are often much faster in NumPy than Pandas.

You don't have to leave Pandas behind, just try using NumPy and Numba for the hot parts of your code. Numba even lets you write Python code that works with the GIL released, which can lead to linear speedup in the number of cores with much less work than multiprocessing without the overhead of copying data to multiple processes.

1 comments

This does not solve the issue of compute scalability - slow computations, which are fundamentally opaque, applied to large data frames . Given a series of data frames (or one large one that can be chunked) how do I apply a long running function to each chunk. For that you need scalability across cores and machines hence Dask.
Why do you consider computations to be opaque? Do you not have the source code?

There is a ton of low hanging speed in many computations that people treat as black boxes. Often as the result of knowing something extra about the specific input data rather than relying on a generic implementation.

In some cases all you need is to write NumPy code instead of Pandas code for a 2-3x speedup. Then suddenly your small cluster program runs on one machine.

Besides the speedup from using native numpy, theres also the potential for 50-100x speedup if your code isn't vectorized to begin with, and anywhere from 1-1000x if theres a couple of joins in there that you can optimize.

But for the latter, see discussion on shifting the pd compute to a RDBMS elsewhere in these comments.

SK Learn is the most popular ML libs. Well written, source code available, etc. But I am not opening it up to optimize it and neither should anyone unless they are already a SK Learn contributor OR have a ton of time on their hands.