Hacker News new | ask | show | jobs
by sambe 3234 days ago
I rarely hear people complain about genuine use-cases but this would seem to be one. However, aren't most/all of the dataframe operations done in C extensions in these cases?
1 comments

While a lot of NumPy is C and Fortran, Pandas is mostly pure Python and some Cython. And mostly it does not release the GIL.

You often end up having to implement your own C extensions or use Numba for the core of your processing. Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

> Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

Is there any sort of list (comprehensive or otherwise) that denotes which NumPy functions are parallelism-friendly? I mean this whether it's in terms of releasing the GIL, in terms of SIMD support, or in terms of being multi-core.

Why are you asking this using a throwaway?

np.dot() is multicore. np.load () (and family) releases the GIL. SIMD mostly depends on the build system, so if you want it you might need to build NumPy from source.

https://stackoverflow.com/questions/24022723/where-can-i-fin...

Is there a way to disable this? In an HPC environment, I don't want routines going multi-core without my explicit permission, under any circumstances. I will already have manually set up the parallelization to be at the highest logical level. If using Python, that usually means I have planned out the number of processes to be equal to the number of cores. If each process then starts doing its own multicore calculation (badly load-balanced!) it overtaxes the node and slows everything down.

I really wish numpy/pandas/scipy wouldn't do this kind of uncontrollable parallelization.

Underlying implementations often have a way to disable parallelism, ie, OMP_NUM_THREADS=1 or MKL_NUM_THREADS=1