Julia is actually initially had OpenMP backed parallelism (ParallelAccelerator.jl), but they're moving away from OpenMP towards a novel and native task parallelism framework more inspired by things like Cilk[0].
I don't know about "clearly the way to go". I think Julia's parallelism models have proven themselves to be very robust, performant and composeable, moreso than OpenMP as far as I'm aware.
How can I annotate an existing loop to offload it on the GPU Inclusive OR on AVX IOR on cpu cores.
Without this ability, in practice I use far less parallelism.
This is currently no official solution in Julia that I'm aware of. However there are several people working on it and a few experimental solution are under active development
AutoOffload is something different, where it's trying to do linear algebra in a way that auto-offloads to GPUs or heterogeneous. GPUifyLoops is correct for this answer, and its next incarnation is KernelAbstractions.jl. These auto-construct GPU kernels and such from loops.
[0] https://julialang.org/blog/2019/07/multithreading/