Hacker News new | ask | show | jobs
by alcidesfonseca 2295 days ago
If I am not mistaken, aparapi included some templates where those low-level aspects were hidden.

The runtime can be added via a jar file, but lambda-based operations must be converted to OpenCL/Cuda/PTX/LLVM or other low-level GPGPU language.

Aparapi did the latter using runtime byte code instrumentation. TornadoVM also does the same thing as a JDK compiler plugin. AeminiumGPU did the same using a transpiler [0] before the actual Java compilation step.

[0] https://github.com/AEminium/AeminiumGPUCompiler

1 comments

Aparapi is a direct translation from Java bytecode to OpenCL. To do so, Aparapi provides a compiler and a runtime system to automatically handle data and execute the generated OpenCL Kernel.

TornadoVM compiles from Java bytecode to OpenCL as well. But additionally, it optimizes and specializes the code by interleaving Graal compiler optimizations, such as partial escape analysis, canonicalization, loop unrolling, constant propagation, etc) with GPU/CPU/FPGA specific optimizations (e.g., parallel loop exploration, automatic use of local memory, parallel skeletons exploration such as reductions). TornadoVM generates different OpenCL code depending on the target device, which means that the code generated for GPUs is different for FPGAs and multi-cores. This is because of OpenCL code is portable across devices, but performance is not portable. TornadoVM addresses this challenge by applying compiler specialization depending on the device.

Additionally, TornadoVM performs live task migration between devices, which means that TornadoVM decides where to execute the code to increase performance (if possible). In other words, TornadoVM switches devices if it knows the new device offers better performance. As far as we know, this is not available in Aparapi (in which device selection is static). With the task-migration, the TornadoVM's approach is to only switch device if it detects application can be executed faster than the CPU execution using the code compiled by C2 or Graal-JIT, otherwise it will stay on CPU. So TornadoVM can be seen as a complement to C2 and Graal. This is because there is no single hardware to best execute all workloads efficiently. GPUs are very good at exploiting SIMD applications, and FPGAs are very good at exploiting pipeline applications. If your applications follow those models, TornadoVM will likely select heterogeneous hardware. Otherwise, it will stay on CPU using the default compilers (C2 or Graal).

Some references:

* Compiler specializations: https://dl.acm.org/doi/10.1145/3237009.3237016

* Parallel skeletons: https://dl.acm.org/doi/10.1145/3281287.3281292

* Live task-migration: https://dl.acm.org/doi/10.1145/3313808.3313819