Just to complement the previous answer. It varies from kernel to kernel. Regarding reductions, the strategy implemented in the TornadoVM JIT compiler allows us to execute reductions offloaded from Java sequential code within 85% of the hand-written OpenCL Kernels: