Hacker News new | ask | show | jobs
by longemen3000 2309 days ago
In Julia, where the paralleization options are explicit (SIMD, AVX, threads or multiprocessing), it always depends on the load, for small operation (around 10000 elements) a single thread is faster only for the thread spawning time (around 1 microsecond). And there is the issue of the independent Blas threaded model, where the Blas threads sometimes interfere with Julia threads... In a nutshell, parallelization is not a magical bullet, but is a good bullet to have at your disposal anyway
2 comments

> And there is the issue of the independent Blas threaded model, where the Blas threads sometimes interfere with Julia threads

Julia has composible multithreading, and using that model fixed composing FFTW threads with Julia's. This can be done to OpenBLAS as well, and IIRC there is a PR open for it.

Yeah, I'm waiting for that PR haahah
Do you know if Julia will add OpenMP support? It's clearly the way to go for offloading to hardware in a productive way.
Julia is actually initially had OpenMP backed parallelism (ParallelAccelerator.jl), but they're moving away from OpenMP towards a novel and native task parallelism framework more inspired by things like Cilk[0].

[0] https://julialang.org/blog/2019/07/multithreading/

I don't know about "clearly the way to go". I think Julia's parallelism models have proven themselves to be very robust, performant and composeable, moreso than OpenMP as far as I'm aware.
How can I annotate an existing loop to offload it on the GPU Inclusive OR on AVX IOR on cpu cores. Without this ability, in practice I use far less parallelism.
This is currently no official solution in Julia that I'm aware of. However there are several people working on it and a few experimental solution are under active development

https://github.com/JuliaDiffEq/AutoOffload.jl

https://juliagpu.gitlab.io/GPUifyLoops.jl/

AutoOffload is something different, where it's trying to do linear algebra in a way that auto-offloads to GPUs or heterogeneous. GPUifyLoops is correct for this answer, and its next incarnation is KernelAbstractions.jl. These auto-construct GPU kernels and such from loops.